In this section, we present the most basic type of a CorpusSearch command file: the ordinary query file. Other types of command files are similar in form, but have different functions, and it is convenient to discuss them separately:
Ordinary query files must have names ending in the extension .q. By default, their output has the same basename as the query file, but the extension is .out. The basename of the output file (but not the .out extension) can be changed with an "-out" switch on the command line (see Installing CorpusSearch).The command file for an ordinary CorpusSearch query has the following schematic form. Obligatory components are underlined:
Here is an example:
- Preamble
- Node specification
- "Ignore" commands
- Output format commands
- Reference to a definition file
- Command specification
// node specification node: IP-MAT* // "ignore" command ignore_nodes: null // output specification print_indices: t // definition file define: mideng.def // command specification - find short examples for slides query: (IP-MAT* iDomsTotal< 10)
The commands in the preamble must all precede the search specification,
but the various components of the preamble can appear in any order. They
are read in order. This means that the node specification can reference a
label from a definition file, but
only if "define:" precedes "node:" in the command file.
Notational variants for Boolean values
CorpusSearch allows the following notational variants for Boolean values:
true | t, true, T, TRUE |
false | f, false, F, FALSE |
The issue of node choice is perhaps less confusing than it used to be given the advent of coding queries, where the effects of node choice are more transparent. |
The "node" command specifies the search domain within which CorpusSearch attempts to execute the search specification. Possible values for the node specification are:
CorpusSearch counts 1 hit because even though there are 2 NP* nodes that match the query, they are contained within a single instance of IP-MAT*.node: IP-MAT* query: (NP* iDoms PRO*)
Running the same query with node boundary NP* yields a different result:/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 1 IP-MAT: 4 NP-SBJ, 5 PRO 1 IP-MAT: 9 NP-OB2, 10 PRO */ (0 (1 IP-MAT (2 CONJ and) (4 NP-SBJ (5 PRO he)) (7 VBD made) (9 NP-OB2 (10 PRO them)) (12 NP-OB1 (13 ADJ grete) (15 N chere)) (17 ADVP (18 ADV out) (20 PP (21 P of) (23 NP (24 N mesure))))) (26 ID CMMALORY,2.13)) /* SUMMARY: source files, hits/tokens/total CMMALORY 1/1/1 whole search, hits/tokens/total 1/1/1 */
Now CorpusSearch counts 2 hits, because each of the 2 instances of NP* is an instance of the node boundary:node: NP* query: (NP* iDoms PRO*)
The way that CorpusSearch counts hits in this case becomes particularly clear if we set "nodes_only" to "true":/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 4 NP-SBJ: 4 NP-SBJ, 5 PRO 9 NP-OB2: 9 NP-OB2, 10 PRO */ (0 (1 IP-MAT (2 CONJ and) (4 NP-SBJ (5 PRO he)) (7 VBD made) (9 NP-OB2 (10 PRO them)) (12 NP-OB1 (13 ADJ grete) (15 N chere)) (17 ADVP (18 ADV out) (20 PP (21 P of) (23 NP (24 N mesure))))) (26 ID CMMALORY,2.13)) /* SUMMARY: source files, hits/tokens/total CMMALORY 2/1/1 whole search, hits/tokens/total 2/1/1 */
This yields the following output:node: NP* nodes_only: t query: (NP* iDoms PRO*)
/~* and he made them grete chere out of mesure (CMMALORY,2.13) *~/ /* 4 NP-SBJ: 4 NP-SBJ, 5 PRO 9 NP-OB2: 9 NP-OB2, 10 PRO */ ( (4 NP-SBJ (5 PRO he)) (26 ID CMMALORY,2.13)) ( (9 NP-OB2 (10 PRO them)) (26 ID CMMALORY,2.13)) /* SUMMARY: source files, hits/tokens/total CMMALORY 2/1/1 whole search, hits/tokens/total 2/1/1 */
The following table is intended as a convenient summary of the commands in this section. For more details, see the entries for the individual commands.
The spaces setting off the alternatives in the following table are intended to improve legibility; they need to be omitted in an actual query (or they will abort the search). |
Command | In conjunction with ... | Default value |
---|---|---|
ignore_words | domsWords and variants | CODE | COMMENT | E_S | ID | LB | RMV:* | ' | \" | , | \. | / | 0 | \** |
ignore_nodes | all other commands | same as above, but not the last two items |
add_to_ignore_words | ignore_words | no default |
add_to_ignore | ignore_nodes |
The spaces setting off the alternatives in the following list are intended to improve legibility; they need to be omitted in an actual query (or they will abort the search). |
Default: COMMENT | CODE | E_S | ID | LB | RMV:* | ' | \" | , | \. | /
In some corpora, punctuation is tagged as, say, PUNC or PON or PONFP. Those tags are not on the default "ignore_nodes" list, and so they must be added to the list by the user (best in the preference file). |
"ignore_nodes" tells CorpusSearch to ignore the specified nodes in connection with all search functions except for domsWords and its variants. (For the latter, CorpusSearch uses the list specified by ignore_words.) For instance, running this query:
returns the following sentence despite the CODE node intervening between NP-1 and PP. This is because the label "CODE" is on the default "ignore_nodes" list.(NP* iPrecedes PP*)
/* 1 IP-MAT-SPE: 5 NP-1, 9 PP */ /~* There ar two bretheren beyond the see, (CMMALORY,15.439) *~/ (0 (1 IP-MAT-SPE (2 NP-SBJ-1 (3 EX There)) (5 BEP ar) (7 NP-1 (8 NUM two) (10 NS bretheren)) (12 CODE <P_15>) (14 PP (15 P beyond) (17 NP (18 D the) (20 N see))) (22 E_S ,)) (24 ID CMMALORY,15.439))
You can replace the default list with a list of your own choosing, and you can also tell CorpusSearch not to ignore any nodes with this command:
ignore_nodes: null
If your query makes reference to an item on your "ignore_nodes" list, CorpusSearch will not issue a warning. The only indication you will get of the incoherent character of your query is the puzzling absence of hits in the output, as illustrated in what follows. |
Here is an example of an incoherent query that makes reference to a node on the ignore_nodes list:
Running the query on the following input:node: IP* ignore_nodes: NP* query: (NP* iDoms PRO*)
yields an output file without any hits.(0 (1 IP-MAT (2 CONJ and) (4 NP-SBJ (5 PRO he)) (7 VBD made) (9 NP-OB2 (10 PRO them)) (12 NP-OB1 (13 ADJ grete) (15 N chere)) (17 ADVP (18 ADV out) (20 PP (21 P of) (23 NP (24 N mesure))))) (26 ID CMMALORY,2.13))
node: IP* query: (NP* iDoms PRO*) */ /* HEADER: source file: CMMALORY */ /* FOOTER source file, hits/tokens/total CMMALORY 0/0/1 */ /* SUMMARY: source files, hits/tokens/total CMMALORY 0/0/1 whole search, hits/tokens/total 0/0/1 */
This is because CorpusSearch is literally following your instruction
to ignore all NP* nodes, including the NP* nodes with node addresses 4 and
9, which would otherwise match the query.
"add_to_ignore" adds nodes to the "ignore_nodes" list (whether the default
list or your own). For instance:
add_to_ignore: {list_of_labels}
Default: n/a
add_to_ignore: INTJ* ← "add_to_ignore", not "add_to_ignore_nodes"
ignore_words: {list_of_labels}
The spaces setting off the alternatives in the following list are intended to improve legibility; they need to be omitted in an actual query (or they will abort the search). |
Default: COMMENT | CODE | E_S | ID | LB | RMV:* | ' | \" | , | \. | / | 0 | \**
"ignore_words" tells CorpusSearch what nodes to ignore when counting words in connection with domsWords and its variants. (For all other search functions, CorpusSearch uses the list specified in ignore_nodes.) The default list for "ignore_words" is the same list as for "ignore_nodes", except that it also includes "0" and traces and empty categories (expressions beginning with asterisk). Nodes on the "ignore_words" list can be terminals or preterminals (in other words, either elements of the text itself or their associated POS tag).
As with the ignore_nodes list, you can edit
the "ignore_words" list, replacing the default list with a list of your
own or with the value "null" (if you want to count all terminal nodes in
the text, without ignoring any).
"add_to_ignore_words" add nodes to the "ignore_words" list (whether the
default list or your own). For instance:
CorpusSearch also supports ISO-LATIN-1 (a.k.a ISO-8859-1), UTF-8, and UTF-16.
The following table is intended as a convenient summary of the commands
in this section. For more details, see the entries for the individual
commands.
add_to_ignore_words: {list_of_labels}
Default: n/a
add_to_ignore_words: [oO]h
Corpus encoding
Default: corpus_encoding: US-ASCII
Output format commands
Output format commands control how search results are printed to the
output file. They do not influence the results of a current search.
However, because they can cause the output of the current search to
include certain nodes or not, they may influence the contents of
searches that depend on the output of the current one.
Command | Default value |
---|---|
begin_remark ... end_remark | n/a |
nodes_only | false |
print_complement | |
print_indices | |
remove_nodes | |
ur_text_only |
This two-part command tells CorpusSearch to print a remark in the preface to the output; for instance, a note about the purpose of a particular search. The following command:
would give rise to an output preface as follows:begin_remark: pronoun objects end_remark
/* PREFACE: regular output file. CorpusSearch copyright Beth Randall 1999. Date: Wed Nov 03 19:12:03 EST 1999 command file: pro-obj.q input file: ip-mat-2vb.out output file: pro-obj.out remark: pronoun objects node: IP* query: (NP-OB* iDoms PRO) */
When the output of some initial search that sets nodes_only to "true" becomes the input to subsequent searches, $ROOT in the subsequent searches refers to the root of the output trees, whose label is determined by (though not necessarily identical with) the node boundary of the initial search. |
If "nodes_only" is set to true, CorpusSearch prints out only the nodes that match the query. If it is set to false, CorpusSearch prints out the entire sentence token containing the structure matching the query. For instance, with the default value of "false", the following query:
gives the following output:node: ADVP* nodes_only: f query: (ADVP* iDoms ADVP*)
The output for the same query, but with "nodes_only" set to "true", looks like this:/~* certayn and wit-owte doute, Ihon is is name. (CMAELR3,45.589) *~/ /* 2 ADVP: 3 ADVP */ (0 (1 IP-MAT (2 ADVP (3 ADVP (4 ADV certayn)) (6 CONJP (7 CONJ and) (9 PP (10 P wit-owte) (12 NP (13 N doute))))) (15 , ,) (16 NP-OB1 (17 NPR Ihon)) (19 BEP is) (21 NP-SBJ (22 PRO$ is) (24 N name)) (26 E_S .)) (28 ID CMAELR3,45.589))
/~* certayn and wit-owte doute, Ihon is is name. (CMAELR3,45.574) *~/ /* 2 ADVP: 3 ADVP */ (0 (2 ADVP (3 ADVP (4 ADV certayn)) (6 CONJP (7 CONJ and) (9 PP (10 P wit-owte) (12 NP (13 N doute)))) (15 , ,)) (28 ID CMAELR3,45.574))
The effect of "print_complement" can now be achieved (more easily and transparently) with coding queries, notably with the ELSE condition. |
CorpusSearch ordinarily outputs only nodes or tokens that match the search specification. Setting "print_complement" to "true" causes CorpusSearch to print the matching tokens to the regular output file (with the extension .out) and also all the tokens that don't match to a separate complement file (with the extension .cmp). In other words, "print_complement" applies logical NOT to an entire query.
For instance, the following query could be used on an input file containing all IPs with objects in order to divide those IPs into two complementary sets: those with two objects (in the .out file) and the remainder (with at most one object).
Here is an example from the regular output file. It matches the query and has two objects:print_complement: t node: IP* query: (IP* iDoms NP-OB1) AND (IP* iDoms NP-OB2)
/~* And there is no knyght now lyvynge that ought to yelde God so grete thanke os ye, (CMMALORY,655.4474) *~/ /* 1 IP-SUB-SPE: 10 NP-OB2, 13 NP-OB1 1 IP-SUB-SPE: 13 NP-OB1, 10 NP-OB2 */ (0 (1 IP-SUB-SPE (2 NP-SBJ *T*-2) (4 MD ought) (6 TO to) (8 VB yelde) (10 NP-OB2 (11 NPR God)) (13 NP-OB1 (14 ADJP (15 ADVR so) (17 ADJ grete)) (19 N thanke) (21 PP (22 P os) (24 NP (25 PRO ye))))) (27 ID CMMALORY,655.4474))And here is an example from the complement file. It has only one object and so fails to match the query.
/~* The kynge lyked and loved this lady wel, (CMMALORY,2.12) *~/ (0 (1 IP-MAT (2 NP-SBJ (3 D The) (5 N kynge)) (7 VBD (8 VBD lyked) (10 CONJ and) (12 VBD loved)) (14 NP-OB1 (15 D this) (17 N lady)) (19 ADVP (20 ADV wel)) (22 E_S ,)) (24 ID CMMALORY,2.12))
The "print_complement" command should only be used on the output of a previous search that has yielded some set that it is sensible to divide into two complementary subsets. Using a query with "print_complement" on an entire corpus will generally result in a .cmp file consisting of a grab bag of tokens without any theoretical significance. |
"print_indices" tells CorpusSearch whether to print node address indices in the output. Node address indices should not be confused with prefix indices (though prefix indices use addresses to fix the reference of search arguments). Node address indices start at 0 for $METAROOT and systematically label every node in the tree, including the word (terminal nodes). By convention, addresses for terminals are not printed even when "print_indices" is set to "true".
Here's a piece of output structure with and without node addresses:
Output including index queries can be a bit cumbersome to read, but when troubleshooting a query that isn't having the intended effect, indices in the tree structures are invaluable pointers for understanding what CorpusSearch is matching and where the query needs to be revised. This is also true more generally in connection with complex queries or with corpora with very long sentence tokens.(10 NP-OB1 (11 NPR Morgan) ← indices 12 "Morgan", 14 "le", 16 "Fay" omitted by convention (13 NPR le) (15 NPR Fay)) (NP-OB1 (NPR Morgan) (NPR le) (NPR Fey))
"remove_nodes" removes subtrees rooted in the same syntactic category as the node boundary when the subtrees are themselves dominated by an instance of that category (see below for details as to what counts as the "same"). The removed subtree is replaced by a label indicating the material that has been removed. If the removed subtree itself matches the query, it will appear in its complete form as a separate output token later in the output file.
CorpusSearch determines the syntactic category of the nodes to be removed according to the following algorithm:
For instance, all of the following node boundaries - IP-MAT-PRN*, IP*, IP - yield IP* as the syntactic category to be removed.
All NP node boundaries yield NP* as the node to be removed. In addition to matching NPs, this will also match the POS tag NPR, which is likely not intended. To avoid this problem, add a hyphen ("NP-*") to the node boundary. A more radical solution is to replace "NPR" by a non-interfering label like "NR" via a revision query or a sed script. |
In the following command file, "remove_nodes" is set to "true":
node: IP* remove_nodes: true query: (NP-OB* iDoms PRO)
Running it on the following input
yields the following output:( (IP-MAT (CONJ and) (NP-SBJ *con*) (VBD counceilled) (NP-OB2 (PRO hym)) (IP-INF (TO to) (VB folowe) (NP-OB1 (PRO hem)) (NP-MSR (ADJP (Q no) (ADJR further)))) (PUNC .)) (ID CMMALORY-M4,14.416))
The structure of the infinitival IP "to folowe hem no further" is removed in the first hit and replaced with the label "RMV:{rmv_string}", where "rmv_string" stands for the concatenation of (up to) the first three words (terminal nodes) of the removed material./~* and counceilled hym to folowe hem no further. (CMMALORY-M4,14.416) *~/ /* 1 IP-MAT: 8 NP-DTV, 9 PRO 11 IP-INF: 16 NP-ACC, 17 PRO */ ( (1 IP-MAT (2 CONJ and) (4 NP-SBJ *con*) (6 VBD counceilled) (8 NP-OB2 (9 PRO hym)) (11 IP-INF RMV:to_folowe_hem...) (25 PUNC .)) (27 ID CMMALORY-M4,14.416)) ( (11 IP-INF (12 TO to) (14 VB folowe) (16 NP-OB1 (17 PRO hem)) (19 NP-MSR (20 ADJP (21 Q no) (23 ADJR further)))) (27 ID CMMALORY-M4,14.416))
"ur_text" refers to the text in its unannotated form. "ur_text_only"
prints only the ur_text of the tokens matching the query, suppressing
printing of the labeled bracketing and associated information.
The "reformat_corpus" command can be thought of as applying a vacuous
query to a file. The result is an output file formatted according to the
same rules as would apply with an ordinary query.
"reformat_corpus" is useful for following up with Unix "diff" when
comparing two versions of the same file or corpus - say, before and after
running a corpus revision query.
By default, single-line comments are introduced by "//", and multi-line
comments (block comments) appear between "/*" and "*/". These comment
delimiters can be changed for input files, but not for command
files.
The command "corpus_line_comment" changes the line comment delimiter.
For instance:
Command specification
query
Every command file must contain a command specification, which instructs
CorpusSearch as to what action to carry out on the input. The command
specification for ordinary queries is:
query: { condition(s) }
The conditions contain one or
more search functions in accordance
with the syntax of the CorpusSearch query
language.
reformat_corpus
A command file consisting of only the following command specification
(without a preamble):
normalizes the format of an input file (adjusting indentation, spacing,
and so on). The output has the extension .fmt, which is appended
to the full name of the input file, including any existing extensions.
reformat_corpus: t
Comments
Comments may appear anywhere in a command file. They are also allowed
anywhere in the input file(s).
The paired commands "corpus_comment_begin" and "corpus_comment_end" change
the block comment delimiters. For instance:
corpus_line_comment: \\
corpus_comment_begin: <+
corpus_comment_end: +>