Corpus revision

Before running any revision query, be sure to have a backup of the input file. Seriously.


Introduction

Revision queries allow users to make automatic changes to a corpus. They are very useful for correcting systematic errors or revising a corpus to fit new annotation guidelines. They can also be used to flag structures for manual review. Finally, they can be used as a "poor person's parser"; by running successive revision queries on a POS-tagged corpus (suitably transformed to conform to Penn Treebank Format), users can build successively larger constituents.

Corpus revisions are implemented by ordinary CS queries that are supplemented with indices linking nodes in the query to revision instructions. The revision-related indices (henceforth, “flags”) are enclosed in curly brackets. Here is the general idea:

query:     ({1}A function B)
       AND (C function {2}D)

revise{1}: info-x
revise{2}: info-y

Here is an example from the Tycho Brahe Corpus of historical Portuguese. Originally, portmanteau items like dos 'of the' were treated as single orthographic words, but it was later decided to split such items into a preposition and a determiner, with "@" indicating the split, as shown below.

Old:     (PP (P+D-P dos)
             (NP (ADJ-P grandes)
                 (N-P homens)

New:     (PP (P d@)
             (NP (D-P @os)
                 (ADJ-P grandes)
                 (N-P homens)
The change can be implemented with the following revision query:
node: PP*

query:     (PP iDoms {1}P+D-P)
       AND (P+D-P iDoms {2}dos)
       AND (P+D-P hasSister NP) 
       AND (P+D-P iPrecedes NP) 
       AND (NP iDomsFirst {3}.*)

replace_label{1}: P
replace_label{2}: d@
add_leaf_before{3}: (D-P @os)

The curly brackets around flags distinguish them from same-instance indices, which are enclosed in square brackets. The following gives an example of a revision query containing both same-instance brackets and revision flags.

Contrary to what one might expect, flags must follow same-instance indices. In other words, curly brackets follow square brackets.
node: IP*

query:     ([1]NP iDoms [2]{1}NP)
       AND ([2]NP iDomsFirst DF)

append_label{1}: -PART
The above query has the effect shown below.
Example input:

(NP (Q beaucoup)
    (NP (DF de) (NCPL livres))

Example output:

(NP (Q beaucoup)
    (NP-PART (DF de) (NCPL livres))

Don't repeat flags

CorpusSearch only needs to have arguments for revisions flagged once. Multiple flags, as in the query below, are ignored.
query:     (NP* iDoms {1}ADV)
       AND (NP* iDoms {2}ADJ)
       AND ({1}ADV iPrecedes {2}ADJ)	← WRONG - repeated flags

add_internal_node{1, 2}: ADJP

In response to a query like the one above, CorpusSearch issues a warning like the following:

WARNING!  Subsequent flag {1} has been ignored.

WARNING!  Subsequent flag {2} has been ignored.

The proper version of the above query is as follows:

query:     (NP* iDoms {1}ADV)
       AND (NP* iDoms {2}ADJ)
       AND (ADV iPrecedes ADJ)		← RIGHT - no repeated flags

add_internal_node{1, 2}: ADJP

Creating a revised corpus

Revision queries are run like ordinary queries. For example:
CS revision.q /path/to/corpus.psd

CS revision.q /path/to/corpus.psd -out better-corpus.psd

Running revision queries like the ones above results in output consisting of only tokens that match the query, as modified by the specified revisions. This option can be useful in developing and testing complex revision queries, as it allows the developer to home in on the relevant tokens and changes.

In general, however, the desired output is a copy of the entire corpus, as modified by the revision query. This can be achieved by adding the following line to a revision query's preamble.

copy_corpus: t

Label changes

The simplest revisions are changes to labels in the tree that otherwise leave the structure intact.
For expository simplicity, the following example queries do not explicitly specify "node". A suitable specification would be $ROOT.

replace_label

Replaces label. Use caution with input that matches wildcard characters (see below).

Format: replace_label{1}: new_label

Example query:

query:  ({1}NP-ACC* iDoms .*)

// any dash tags after ACC will be deleted!
replace_label{1}: NP-OBJ
Example input:
( (IP-MAT (NP-LFD (D This) (ONE one))
          (, ,)
          (NP-SBJ (PRO I))
          (VBP like)
          (NP-ACC-RSP (PRO it))
          (ADVP (ADVR better))
          (. .)))
Example output:
( (IP-MAT (NP-LFD (D This) (ONE one))
          (, ,)
          (NP-SBJ (PRO I))
          (VBP like)
          (NP-OBJ (PRO it))             ← -RSP tag not retained
          (ADVP (ADVR better))
          (. .)))

prepend_label

Format: prepend_label{1}: prefix

Example query:

query:     ({1}NP* iDoms CL)

prepend_label{1}: CL-
Example input:
( (IP-MAT (NP-SBJ (PRO Je))
          (NP-ACC (CL les))
          (VJ vois)
          (. .)))
Example output:
( (IP-MAT (NP-SBJ (PRO Je))
          (CL-NP-ACC (CL les))
          (VJ vois)
          (. .)))

append_label

Format: append_label{1}: suffix

Example query:

query:     ({1}NP* iDoms CL)

append_label{1}: -CL
Example input (same as for prepend_label):
( (IP-MAT (NP-SBJ (PRO Je))
          (NP-ACC (CL les))
          (VJ vois)
          (. .)))
Example output:
( (IP-MAT (NP-SBJ (PRO Je))
          (NP-ACC-CL (CL les))
          (VJ vois)
          (. .)))

pre_crop_label

Deletes the part of the label that ends with (and includes) the leftmost instance of the specified character.

Format: pre_crop_label{1}: delete_before_and_including_me

Example query:

query: (NP* iDoms {1}.*+N)

pre_crop_label{1}: +
Example input:
( (NP (ADJ+N wildlife)))

( (NP (NPR+P+D+N Jack-in-the-pulpit)))
Example output:
(NP (N wildlife))

(NP (P+D+N Jack-in-the-pulpit))     ← probably not the desired result

post_crop_label

Deletes the part of the label that begins with (and includes) the leftmost instance of the specified character.

Format: post_crop_label{1}: delete_after_and_including_me

Example query:

query: (NP* iDoms {1}N+.*|NS+.*)

post_crop_label{1}: +
Example input:
( (NP (N+ADJ court-martial)))

( (NP (NS+P+N mothers-in-law)))
Example output:
( (NP (N court-martial)))

( (NP (NS mothers-in-law)))

Here is an example combining "post_crop_label" and "append_label" to replace a dash tag:

query:  ({1}NP-ACC|NP-DTV iDoms N*)

post_crop_label{1}: -
append_label{1}: -OBJ
Example input:
( (IP-MAT (NP-SBJ (PRO You))
          (MD must)
          (NEG not)
          (VB exspecte)
          (NP-ACC (Q no) (ADJ greate) (NS matters))
          (NP-TMP (D this) (N time))
          (. ,)) (ID KNYVETT-1630,87.25))
Example output:
( (IP-MAT (NP-SBJ (PRO You))
          (MD must)
          (NEG not)
          (VB exspecte)
          (NP-OBJ (Q no) (ADJ greate) (NS matters))
          (NP-TMP (D this) (N time))
          (. ,))
  (ID KNYVETT-1630,87.25))

co_index

Format: co_index{1, 2}:

Example query:

query: ({1}IP-MAT iDoms {2}IP-PPL)

co_index{1, 2}:
Example input:
( (IP-MAT (CONJ And)
          (ADVP (ADV so))
          (PP (P by)
              (NP (NS meanes)))
          (NP-SBJ (NPR kynge) (NPR Uther))
          (VBD send)
          (PP (P for)
              (NP (D this) (N duk)))
          (IP-PPL (VAG chargyng)
                  (NP-OB1 (PRO hym))
                  (IP-INF (TO to)
                          (VB brynge)
                          (NP-OB1 (PRO$ his) (N wyf))
                          (PP (P with)
                              (NP (PRO hym)))))
          (E_S ,))
  (ID CMMALORY,2.8))
Example output:
( (IP-MAT-1 (CONJ And)
	    (ADVP (ADV so))
	    (PP (P by)
		(NP (NS meanes)))
	    (NP-SBJ (NPR kynge) (NPR Uther))
	    (VBD send)
	    (PP (P for)
		(NP (D this) (N duk)))
	    (IP-PPL-1 (VAG chargyng)
		      (NP-OB1 (PRO hym))
		      (IP-INF (TO to)
			      (VB brynge)
			      (NP-OB1 (PRO$ his) (N wyf))
			      (PP (P with)
				  (NP (PRO hym)))))
	    (E_S ,))
  (ID CMMALORY,2.8))

Structural changes

If a revision would result in an illegal structure (for instance, a tree with crossing branches, or a tree containing an internal node without leaf descendants, CS issues a warning and does not change the tree.

delete_node

Implements "pruning". A node is deleted, but its descendants remain.

Format: delete_node{1}:

Example query:

query: ([1]ADVP* iDomsOnly [2]{1}ADVP*)

delete_node{1}:
Example input:
( (FRAG (WNP (WPRO What))
        (ADVP-TMP (ADVP (ADV neuer)))
        (NP (D a) (ADJ great) (N belly))
        (ADVP (ADV yet))
        (. ?)) (ID DELONEY,69.5))
Example output:
( (FRAG (WNP (WPRO What))
        (ADVP-TMP (ADV neuer)
        (NP (D a) (ADJ great) (N belly))
        (ADVP (ADV yet)
        (. ?))
  (ID DELONEY,69.5))

move_up_node

Format: move_up_node{1}:

Example query:

query:  (NP* iDoms {1}PP)

move_up_node{1}:
Example input:
( (IP-MAT (NP-SBJ (PRO He))
          (VBD saw)
          (NP-ACC (D the) (N man)
		  (PP (P with)
		      (NP (D the) (N telescope))))
          (. .)))
Example output:

( (IP-MAT (NP-SBJ (PRO He))
	  (VBD saw)
	  (NP-ACC (D the) (N man))
	  (PP (P with)
	      (NP (D the) (N telescope)))
	  (. .)))

If the target node is a middle or only child, CS issues a warning and does not change the tree. In the following example, if the first PP ("with the telescope") needs to be moved up, the query will need to be applied recursively.

Example input:

( (IP-MAT (NP-SBJ (PRO He))
          (VBD saw)
          (NP-ACC (D the) (N man)
		  (PP (P with)
		      (NP (D the) (N telescope)))
		  (PP (P in)
		      (NP (D the) (N tree))))
          (. .)))
Example output:
WARNING! could not move_up_node{1}: (12 PP)

( (IP-MAT (NP-SBJ (PRO He))
	  (VBD saw)
	  (NP-ACC (D the)
		  (N man)
		  (PP (P with)
		      (NP (D the) (N telescope))))
	  (PP (P in)
	      (NP (D the) (N tree)))
	  (. .)))

move_up_nodes

Format: move_up_nodes{1, 2}:

Example query:

query:     (NP* iDoms {1}PP)
       AND (NP* iDoms {2}IP-PPL)
       AND (NP* iDoms N|NS)
       AND (N|NS iPrecedes PP)

move_up_nodes{1, 2}:
Example input:
( (IP-MAT (NP-SBJ (PRO He))
          (VBD saw)
          (NP-ACC (D the) (N man)
		  (PP (P with)
		      (NP (D the) (N telescope)))
		  (PP (P in)
		      (NP (D the) (N tree)))
		  (IP-PPL (VAG munching)
			  (NP-ACC (D an) (N apple))))
          (. .)))
Example output:
( (IP-MAT (NP-SBJ (PRO He))
	  (VBD saw)
	  (NP-ACC (D the) (N man))
	  (PP (P with)
	      (NP (D the) (N telescope)))
	  (PP (P in)
	      (NP (D the) (N tree)))
	  (IP-PPL (VAG munching)
		  (NP-ACC (D an) (N apple)))
	  (. .)))
As usual, if the revision would result in an illegal tree, CS issues a warning and does not change the tree.

move_to

Moves a node flagged {1} to become a daughter of a target node flagged {2}.

Format: move_to{1, 2}:

Example query:

query:     (IP* iDoms {2}NP*)
       AND (IP* iDoms {1}PP)
       AND (PP iDoms P)
       AND (P iDoms [oO]f)       
       AND (NP* iPrecedes PP)
move_to{1, 2}:

Example input:

( (IP-MAT (NP-SBJ (PRO He))
          (VBP knows)
          (NP-ACC (D the) (N king))
          (PP (P of)
              (NP (NPR England)))
          (. .)))
Example output:
( (IP-MAT (NP-SBJ (PRO He))
	  (VBP knows)
	  (NP-ACC (D the) (N king)
		  (PP (P of)
		      (NP (NPR England))))
	  (. .)))

extend_span

Extends the span of some constituent over an immediately adjacent sister. The order of the arguments is important.

Format: extend_span{1, 2}:

Example query:

query:     ({1}D hasSister {2}NP*)
       AND (D iPrecedes NP*)

extend_span{2, 1}:
Example input:
( (IP-MAT (D the)
          (NP-SBJ (ADJ basic)
                  (N problem))
          (BEP is)
          (NP-OB1 (D this))
          (. .)))
Example output:
( (IP-MAT (NP-SBJ (D the) 
                  (ADJ basic)
                  (N problem))
          (BEP is)
          (NP-OB1 (D this))
          (. .)))

add_internal_node

Adds a parent node over a specified span. Repeating the first index yields a unary-branching parent node.

Format: add_internal_node{1, 2}: new_node

Example query:

query:  ({1}MD HasSister {2}VB)

add_internal_node{1, 2}: VERB-COMPLEX
Example input:
( (IP-MAT-SPE (' ')
              (NP-VOC (N Sir))
              (, ,)
              (' ')
              (IP-MAT-PRN (VBD said)
                          (NP-SBJ (NPR Ulfius)))
              (, ,)
              (' ')
              (NP-SBJ (PRO he))
              (MD wille)
              (NEG not)
              (VB dwelle)
              (NP-MSR (ADJ long))
              (. .)
              (' '))
  (ID CMMALORY,3.66))
Example output:
( (IP-MAT-SPE (' ')
              (NP-VOC (N Sir))
              (, ,)
              (' ')
              (IP-MAT-PRN (VBD said)
                          (NP-SBJ (NPR Ulfius)))
              (, ,)
              (' ')
              (NP-SBJ (PRO he))
              (VERB-COMPLEX (MD wille) (NEG not) (VB dwelle))
              (NP-MSR (ADJ long))
              (. .)
              (' '))
  (ID CMMALORY,3.66))

add_leaf_before, add_leaf_after

Format: add_leaf_before{1}: (preterminal terminal)

add_leaf_after{1}: (preterminal terminal)

Adds a sister either before or after a flagged node.

Example query:

query:  (PP iDoms {1}P)

add_leaf_before{1}: (X FOO)
add_leaf_after{1}: (Y BAR)
Example input:
( (IP-MAT (PP (P Unto)
              (NP (D that)))
          (NP-SBJ (PRO they)
                  (QP (Q all)))
          (ADVP (ADV well))
	  (VBD accordyd))
  (ID CMMALORY,5.110) )
Example output:
( (IP-MAT (PP (X FOO)
              (P Unto)
              (Y BAR)
              (NP (D that)))
          (NP-SBJ (PRO they)
                  (QP (Q all)))
          (ADVP (ADV well))
          (VBD accordyd))
  (ID CMMALORY,5.110))

trace_before

Adds a trace before the node flagged {2} and at the same time coindexes the trace with the node flagged {1}.

Format: trace_before{1, 2}: (preterminal terminal)

Example query:

query:     (CP* iDoms {1}WNP*)
       AND (CP* iDoms IP-SUB*)
       AND (IP-SUB* iDomsFirst {2}.*)

trace_before{1, 2}: (NP-SBJ *T*)
Example input:
( (CP-QUE-MAT-SPE (NP-VOC (NPR Sir) (NPR Melyas))
                  (, ,)
                  (WNP (WPRO who))
                  (IP-SUB-SPE (HVP hath)
                              (VBN wounded)
                              (NP-OB1 (PRO you)))
                  (. ?)) (ID CMMALORY,645.4103))
Example output:
( (CP-QUE-MAT-SPE (NP-VOC (NPR Sir) (NPR Melyas))
                  (, ,)
                  (WNP-1 (WPRO who))
                  (IP-SUB-SPE (NP-SBJ *T*-1)
                              (HVP hath)
                              (VBN wounded)
                              (NP-OB1 (PRO you)))
                  (. ?)) (ID CMMALORY,645.4103))

concat

Format: concat{1, 2}:

Concatenates the terminals dominated by two preterminal (POS) tags. Primarily useful for concatenating coding strings (see Coding queries for details), which formally are orthographic words dominated by the preterminal node CODING-* (where "*" is the boundary node for the coding query). Coding strings are constrained to contain information only about the structures dominated by the boundary node where they are inserted, but "concat" allows information associated with different nodes to appear in the same coding string.

For instance, in a study of relative clauses, it might be useful to study correlations between the properties of a relative clause and its head noun phrase. In the example below, the properties of the noun phrase and of the relative clause are captured in the coding strings CODING-NP* and CODING-CP-REL*. The concat command appends the coding string specified by the index {2} to the one specified by the index {1}, copying it upwards. (In what follows, the queries could be shortened with iDomsMod; we use iDoms for clarity.)

 
query:      (NP* iDoms CODING-NP*)
        AND (CODING-NP* iDoms [1]{1}.*)
        AND (NP* iDoms CP-REL*)
        AND (CP-REL* iDoms CODING-CP-REL*)
        AND (CODING-CP-REL* iDoms [2]{2}.*)

concat{2, 1}:

Example input (schematic):

( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c)
                  ...
                  (CP-REL (CODING-CP-REL d:e:f)))
          ...))

Example output:

( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c:d:e:f)
                  ...
                  (CP-REL (CODING-CP-REL d:e:f)))
          ...))
Concatenation in the reverse direction (downward) is of course also possible.

Example query (same query as above, but different order of flag indices):

 
query:      (NP* iDoms CODING-NP*)
        AND (CODING-NP* iDoms [1]{1}.*)
        AND (NP* iDoms CP-REL*)
        AND (CP-REL* iDoms CODING-CP-REL*)
        AND (CODING-CP-REL* iDoms [2]{2}.*)

concat{1, 2}:

Example input (same as for original query):

( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c)
                  ...
                  (CP-REL (CODING-CP-REL d:e:f)))
          ...))

Example output (schematic):

( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c)
                  ...
                  (CP-REL (CODING-CP-REL d:e:f:a:b:c)))
          ...))

The following example illustrates how verb-level lemma information in a lemmatized corpus can be copied to an IP-level coding string.

Example query:

query:      (IP-MAT* iDoms CODING*)
        AND (CODING* iDoms [1]{1}.*)
        AND (IP-MAT* iDoms VB*)
        AND (VB* iDoms METAWORD)
        AND (METAWORD iDoms LEMMA)
        AND (LEMMA iDoms OEDID)
        AND (OEDID iDoms {2}[2].*)

concat{2, 1}:
Example input:
( (IP-MAT (CODING-IP-MAT do-neg:subj-pro:v2-no)
          (NP-SBJ (PRO (ORTHO They)
                  (META (LEMMA (HEADWORD they) (OEDID 200700)))))
          (DOD (ORTHO did@)
               (META (LEMMA (HEADWORD they) (OEDID 56228))))
          (NEG (ORTHO @n't)
               (META (LEMMA (HEADWORD they) (OEDID 128494))))
          (VB (ORTHO come)
              (META (LEMMA (HEADWORD come) (OEDID 36824))))
          (. .)))
Example output:
( (IP-MAT (CODING-IP-MAT do-neg:subj-pro:v2-no:36824)
	  (NP-SBJ (PRO (ORTHO They)
		       (META (LEMMA (HEADWORD they) (OEDID 200700)))))
	  (DOD (ORTHO did@)
	       (META (LEMMA (HEADWORD they) (OEDID 56228))))
	  (NEG (ORTHO @n't)
	       (META (LEMMA (HEADWORD they) (OEDID 128494))))
	  (VB (ORTHO come)
	      (META (LEMMA (HEADWORD come) (OEDID 36824))))
	  (. .)))