Before running any revision query, be sure to have a backup of the input file. Seriously. |
Revision queries allow users to make automatic changes to a corpus. They are very useful for correcting systematic errors or revising a corpus to fit new annotation guidelines. They can also be used to flag structures for manual review. Finally, they can be used as a "poor person's parser"; by running successive revision queries on a POS-tagged corpus (suitably transformed to conform to Penn Treebank Format), users can build successively larger constituents.
Corpus revisions are implemented by ordinary CS queries that are supplemented with indices linking nodes in the query to revision instructions. The revision-related indices (henceforth, “flags”) are enclosed in curly brackets. Here is the general idea:
query: ({1}A function B) AND (C function {2}D) revise{1}: info-x revise{2}: info-y
Here is an example from the Tycho Brahe Corpus of historical Portuguese. Originally, portmanteau items like dos 'of the' were treated as single orthographic words, but it was later decided to split such items into a preposition and a determiner, with "@" indicating the split, as shown below.
The change can be implemented with the following revision query:Old: (PP (P+D-P dos) (NP (ADJ-P grandes) (N-P homens) New: (PP (P d@) (NP (D-P @os) (ADJ-P grandes) (N-P homens)
node: PP* query: (PP iDoms {1}P+D-P) AND (P+D-P iDoms {2}dos) AND (P+D-P hasSister NP) AND (P+D-P iPrecedes NP) AND (NP iDomsFirst {3}.*) replace_label{1}: P replace_label{2}: d@ add_leaf_before{3}: (D-P @os)
The curly brackets around flags distinguish them from same-instance indices, which are enclosed in square brackets. The following gives an example of a revision query containing both same-instance brackets and revision flags.
Contrary to what one might expect, flags must follow same-instance indices. In other words, curly brackets follow square brackets. |
The above query has the effect shown below.node: IP* query: ([1]NP iDoms [2]{1}NP) AND ([2]NP iDomsFirst DF) append_label{1}: -PART
Example input: (NP (Q beaucoup) (NP (DF de) (NCPL livres)) Example output: (NP (Q beaucoup) (NP-PART (DF de) (NCPL livres))
query: (NP* iDoms {1}ADV) AND (NP* iDoms {2}ADJ) AND ({1}ADV iPrecedes {2}ADJ) ← WRONG - repeated flags add_internal_node{1, 2}: ADJP
In response to a query like the one above, CorpusSearch issues a warning like the following:
WARNING! Subsequent flag {1} has been ignored. WARNING! Subsequent flag {2} has been ignored.
The proper version of the above query is as follows:
query: (NP* iDoms {1}ADV) AND (NP* iDoms {2}ADJ) AND (ADV iPrecedes ADJ) ← RIGHT - no repeated flags add_internal_node{1, 2}: ADJP
CS revision.q /path/to/corpus.psd CS revision.q /path/to/corpus.psd -out better-corpus.psd
Running revision queries like the ones above results in output consisting of only tokens that match the query, as modified by the specified revisions. This option can be useful in developing and testing complex revision queries, as it allows the developer to home in on the relevant tokens and changes.
In general, however, the desired output is a copy of the entire corpus, as modified by the revision query. This can be achieved by adding the following line to a revision query's preamble.
copy_corpus: t
Format: replace_label{1}: new_label
Example query:
Example input:query: ({1}NP-ACC* iDoms .*) // any dash tags after ACC will be deleted! replace_label{1}: NP-OBJ
Example output:( (IP-MAT (NP-LFD (D This) (ONE one)) (, ,) (NP-SBJ (PRO I)) (VBP like) (NP-ACC-RSP (PRO it)) (ADVP (ADVR better)) (. .)))
( (IP-MAT (NP-LFD (D This) (ONE one)) (, ,) (NP-SBJ (PRO I)) (VBP like) (NP-OBJ (PRO it)) ← -RSP tag not retained (ADVP (ADVR better)) (. .)))
Format: prepend_label{1}: prefix
Example query:
Example input:query: ({1}NP* iDoms CL) prepend_label{1}: CL-
Example output:( (IP-MAT (NP-SBJ (PRO Je)) (NP-ACC (CL les)) (VJ vois) (. .)))
( (IP-MAT (NP-SBJ (PRO Je)) (CL-NP-ACC (CL les)) (VJ vois) (. .)))
Format: append_label{1}: suffix
Example query:
Example input (same as for prepend_label):query: ({1}NP* iDoms CL) append_label{1}: -CL
Example output:( (IP-MAT (NP-SBJ (PRO Je)) (NP-ACC (CL les)) (VJ vois) (. .)))
( (IP-MAT (NP-SBJ (PRO Je)) (NP-ACC-CL (CL les)) (VJ vois) (. .)))
Format: pre_crop_label{1}: delete_before_and_including_me
Example query:
Example input:query: (NP* iDoms {1}.*+N) pre_crop_label{1}: +
Example output:( (NP (ADJ+N wildlife))) ( (NP (NPR+P+D+N Jack-in-the-pulpit)))
(NP (N wildlife)) (NP (P+D+N Jack-in-the-pulpit)) ← probably not the desired result
Format: post_crop_label{1}: delete_after_and_including_me
Example query:
Example input:query: (NP* iDoms {1}N+.*|NS+.*) post_crop_label{1}: +
Example output:( (NP (N+ADJ court-martial))) ( (NP (NS+P+N mothers-in-law)))
( (NP (N court-martial))) ( (NP (NS mothers-in-law)))
Here is an example combining "post_crop_label" and "append_label" to replace a dash tag:
Example input:query: ({1}NP-ACC|NP-DTV iDoms N*) post_crop_label{1}: - append_label{1}: -OBJ
Example output:( (IP-MAT (NP-SBJ (PRO You)) (MD must) (NEG not) (VB exspecte) (NP-ACC (Q no) (ADJ greate) (NS matters)) (NP-TMP (D this) (N time)) (. ,)) (ID KNYVETT-1630,87.25))
( (IP-MAT (NP-SBJ (PRO You)) (MD must) (NEG not) (VB exspecte) (NP-OBJ (Q no) (ADJ greate) (NS matters)) (NP-TMP (D this) (N time)) (. ,)) (ID KNYVETT-1630,87.25))
Format: co_index{1, 2}:
Example query:
Example input:query: ({1}IP-MAT iDoms {2}IP-PPL) co_index{1, 2}:
Example output:( (IP-MAT (CONJ And) (ADVP (ADV so)) (PP (P by) (NP (NS meanes))) (NP-SBJ (NPR kynge) (NPR Uther)) (VBD send) (PP (P for) (NP (D this) (N duk))) (IP-PPL (VAG chargyng) (NP-OB1 (PRO hym)) (IP-INF (TO to) (VB brynge) (NP-OB1 (PRO$ his) (N wyf)) (PP (P with) (NP (PRO hym))))) (E_S ,)) (ID CMMALORY,2.8))
( (IP-MAT-1 (CONJ And) (ADVP (ADV so)) (PP (P by) (NP (NS meanes))) (NP-SBJ (NPR kynge) (NPR Uther)) (VBD send) (PP (P for) (NP (D this) (N duk))) (IP-PPL-1 (VAG chargyng) (NP-OB1 (PRO hym)) (IP-INF (TO to) (VB brynge) (NP-OB1 (PRO$ his) (N wyf)) (PP (P with) (NP (PRO hym))))) (E_S ,)) (ID CMMALORY,2.8))
If a revision would result in an illegal structure (for instance, a tree
with crossing branches, or a tree containing an internal node without
leaf descendants, CS issues a warning and does not change the tree.
Format: delete_node{1}:
Example query:
Format: move_up_node{1}:
Example query:
If the target node is a middle or only child, CS issues a warning and
does not change the tree. In the following example, if the first PP
("with the telescope") needs to be moved up, the query will need to be
applied recursively.
Example input:
Format: move_up_nodes{1, 2}:
Example query:
Format: move_to{1, 2}:
Example query:
Example input:
Format: extend_span{1, 2}:
Example query:
Format: add_internal_node{1, 2}: new_node
Example query:
Format: add_leaf_before{1}: (preterminal terminal)
add_leaf_after{1}: (preterminal terminal)
Adds a sister either before or after a flagged node.
Example query:
Format: trace_before{1, 2}: (preterminal terminal)
Example query:
Format: concat{1, 2}:
Concatenates the terminals dominated by two preterminal (POS) tags.
Primarily useful for concatenating coding strings
(see Coding queries for details),
which formally are orthographic words dominated by the preterminal node
CODING-* (where "*" is the boundary node for the coding query). Coding
strings are constrained to contain information only about the structures
dominated by the boundary node where they are inserted, but "concat"
allows information associated with different nodes to appear in the same
coding string.
For instance, in a study of relative clauses, it might be useful to study
correlations between the properties of a relative clause and its head noun
phrase. In the example below, the properties of the noun phrase and of
the relative clause are captured in the coding strings CODING-NP* and
CODING-CP-REL*. The concat command appends the coding string specified by
the index {2} to the one specified by the index {1}, copying it upwards.
(In what follows, the queries could be shortened
with iDomsMod;
we use iDoms for clarity.)
Example input (schematic):
Example output:
Example query (same query as above, but different order of flag indices):
Example input (same as for original query):
Example output (schematic):
The following example illustrates how verb-level lemma information in a
lemmatized corpus can be copied to an IP-level coding string.
Example query:
delete_node
Implements "pruning". A node is deleted, but its descendants remain.
Example input:
query: ([1]ADVP* iDomsOnly [2]{1}ADVP*)
delete_node{1}:
Example output:
( (FRAG (WNP (WPRO What))
(ADVP-TMP (ADVP (ADV neuer)))
(NP (D a) (ADJ great) (N belly))
(ADVP (ADV yet))
(. ?)) (ID DELONEY,69.5))
( (FRAG (WNP (WPRO What))
(ADVP-TMP (ADV neuer)
(NP (D a) (ADJ great) (N belly))
(ADVP (ADV yet)
(. ?))
(ID DELONEY,69.5))
move_up_node
Example input:
query: (NP* iDoms {1}PP)
move_up_node{1}:
Example output:
( (IP-MAT (NP-SBJ (PRO He))
(VBD saw)
(NP-ACC (D the) (N man)
(PP (P with)
(NP (D the) (N telescope))))
(. .)))
( (IP-MAT (NP-SBJ (PRO He))
(VBD saw)
(NP-ACC (D the) (N man))
(PP (P with)
(NP (D the) (N telescope)))
(. .)))
Example output:
( (IP-MAT (NP-SBJ (PRO He))
(VBD saw)
(NP-ACC (D the) (N man)
(PP (P with)
(NP (D the) (N telescope)))
(PP (P in)
(NP (D the) (N tree))))
(. .)))
WARNING! could not move_up_node{1}: (12 PP)
( (IP-MAT (NP-SBJ (PRO He))
(VBD saw)
(NP-ACC (D the)
(N man)
(PP (P with)
(NP (D the) (N telescope))))
(PP (P in)
(NP (D the) (N tree)))
(. .)))
move_up_nodes
Example input:
query: (NP* iDoms {1}PP)
AND (NP* iDoms {2}IP-PPL)
AND (NP* iDoms N|NS)
AND (N|NS iPrecedes PP)
move_up_nodes{1, 2}:
Example output:
( (IP-MAT (NP-SBJ (PRO He))
(VBD saw)
(NP-ACC (D the) (N man)
(PP (P with)
(NP (D the) (N telescope)))
(PP (P in)
(NP (D the) (N tree)))
(IP-PPL (VAG munching)
(NP-ACC (D an) (N apple))))
(. .)))
As usual, if the revision would result in an illegal tree, CS issues a
warning and does not change the tree.
( (IP-MAT (NP-SBJ (PRO He))
(VBD saw)
(NP-ACC (D the) (N man))
(PP (P with)
(NP (D the) (N telescope)))
(PP (P in)
(NP (D the) (N tree)))
(IP-PPL (VAG munching)
(NP-ACC (D an) (N apple)))
(. .)))
move_to
Moves a node flagged {1} to become a daughter of a target node flagged {2}.
query: (IP* iDoms {2}NP*)
AND (IP* iDoms {1}PP)
AND (PP iDoms P)
AND (P iDoms [oO]f)
AND (NP* iPrecedes PP)
move_to{1, 2}:
Example output:
( (IP-MAT (NP-SBJ (PRO He))
(VBP knows)
(NP-ACC (D the) (N king))
(PP (P of)
(NP (NPR England)))
(. .)))
( (IP-MAT (NP-SBJ (PRO He))
(VBP knows)
(NP-ACC (D the) (N king)
(PP (P of)
(NP (NPR England))))
(. .)))
extend_span
Extends the span of some constituent over an immediately adjacent sister.
The order of the arguments is important.
Example input:
query: ({1}D hasSister {2}NP*)
AND (D iPrecedes NP*)
extend_span{2, 1}:
Example output:
( (IP-MAT (D the)
(NP-SBJ (ADJ basic)
(N problem))
(BEP is)
(NP-OB1 (D this))
(. .)))
( (IP-MAT (NP-SBJ (D the)
(ADJ basic)
(N problem))
(BEP is)
(NP-OB1 (D this))
(. .)))
add_internal_node
Adds a parent node over a specified span. Repeating the first index
yields a unary-branching parent node.
Example input:
query: ({1}MD HasSister {2}VB)
add_internal_node{1, 2}: VERB-COMPLEX
Example output:
( (IP-MAT-SPE (' ')
(NP-VOC (N Sir))
(, ,)
(' ')
(IP-MAT-PRN (VBD said)
(NP-SBJ (NPR Ulfius)))
(, ,)
(' ')
(NP-SBJ (PRO he))
(MD wille)
(NEG not)
(VB dwelle)
(NP-MSR (ADJ long))
(. .)
(' '))
(ID CMMALORY,3.66))
( (IP-MAT-SPE (' ')
(NP-VOC (N Sir))
(, ,)
(' ')
(IP-MAT-PRN (VBD said)
(NP-SBJ (NPR Ulfius)))
(, ,)
(' ')
(NP-SBJ (PRO he))
(VERB-COMPLEX (MD wille) (NEG not) (VB dwelle))
(NP-MSR (ADJ long))
(. .)
(' '))
(ID CMMALORY,3.66))
add_leaf_before, add_leaf_after
Example input:
query: (PP iDoms {1}P)
add_leaf_before{1}: (X FOO)
add_leaf_after{1}: (Y BAR)
Example output:
( (IP-MAT (PP (P Unto)
(NP (D that)))
(NP-SBJ (PRO they)
(QP (Q all)))
(ADVP (ADV well))
(VBD accordyd))
(ID CMMALORY,5.110) )
( (IP-MAT (PP (X FOO)
(P Unto)
(Y BAR)
(NP (D that)))
(NP-SBJ (PRO they)
(QP (Q all)))
(ADVP (ADV well))
(VBD accordyd))
(ID CMMALORY,5.110))
trace_before
Adds a trace before the node flagged {2} and at the same time coindexes
the trace with the node flagged {1}.
Example input:
query: (CP* iDoms {1}WNP*)
AND (CP* iDoms IP-SUB*)
AND (IP-SUB* iDomsFirst {2}.*)
trace_before{1, 2}: (NP-SBJ *T*)
Example output:
( (CP-QUE-MAT-SPE (NP-VOC (NPR Sir) (NPR Melyas))
(, ,)
(WNP (WPRO who))
(IP-SUB-SPE (HVP hath)
(VBN wounded)
(NP-OB1 (PRO you)))
(. ?)) (ID CMMALORY,645.4103))
( (CP-QUE-MAT-SPE (NP-VOC (NPR Sir) (NPR Melyas))
(, ,)
(WNP-1 (WPRO who))
(IP-SUB-SPE (NP-SBJ *T*-1)
(HVP hath)
(VBN wounded)
(NP-OB1 (PRO you)))
(. ?)) (ID CMMALORY,645.4103))
concat
query: (NP* iDoms CODING-NP*)
AND (CODING-NP* iDoms [1]{1}.*)
AND (NP* iDoms CP-REL*)
AND (CP-REL* iDoms CODING-CP-REL*)
AND (CODING-CP-REL* iDoms [2]{2}.*)
concat{2, 1}:
( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c)
...
(CP-REL (CODING-CP-REL d:e:f)))
...))
Concatenation in the reverse direction (downward) is of course also
possible.
( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c:d:e:f)
...
(CP-REL (CODING-CP-REL d:e:f)))
...))
query: (NP* iDoms CODING-NP*)
AND (CODING-NP* iDoms [1]{1}.*)
AND (NP* iDoms CP-REL*)
AND (CP-REL* iDoms CODING-CP-REL*)
AND (CODING-CP-REL* iDoms [2]{2}.*)
concat{1, 2}:
( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c)
...
(CP-REL (CODING-CP-REL d:e:f)))
...))
( (IP-MAT (NP-SBJ (CODING-NP-SBJ a:b:c)
...
(CP-REL (CODING-CP-REL d:e:f:a:b:c)))
...))
Example input:
query: (IP-MAT* iDoms CODING*)
AND (CODING* iDoms [1]{1}.*)
AND (IP-MAT* iDoms VB*)
AND (VB* iDoms METAWORD)
AND (METAWORD iDoms LEMMA)
AND (LEMMA iDoms OEDID)
AND (OEDID iDoms {2}[2].*)
concat{2, 1}:
Example output:
( (IP-MAT (CODING-IP-MAT do-neg:subj-pro:v2-no)
(NP-SBJ (PRO (ORTHO They)
(META (LEMMA (HEADWORD they) (OEDID 200700)))))
(DOD (ORTHO did@)
(META (LEMMA (HEADWORD they) (OEDID 56228))))
(NEG (ORTHO @n't)
(META (LEMMA (HEADWORD they) (OEDID 128494))))
(VB (ORTHO come)
(META (LEMMA (HEADWORD come) (OEDID 36824))))
(. .)))
( (IP-MAT (CODING-IP-MAT do-neg:subj-pro:v2-no:36824)
(NP-SBJ (PRO (ORTHO They)
(META (LEMMA (HEADWORD they) (OEDID 200700)))))
(DOD (ORTHO did@)
(META (LEMMA (HEADWORD they) (OEDID 56228))))
(NEG (ORTHO @n't)
(META (LEMMA (HEADWORD they) (OEDID 128494))))
(VB (ORTHO come)
(META (LEMMA (HEADWORD come) (OEDID 36824))))
(. .)))