The command file


Introduction

In this section, we present the most basic type of a CorpusSearch command file: the ordinary query file. Other types of command files are similar in form, but have different functions, and it is convenient to discuss them separately:

Ordinary query files must have names ending in the extension .q. By default, their output has the same basename as the query file, but the extension is .out. The basename of the output file (but not the .out extension) can be changed with an "-out" switch on the command line (see Installing CorpusSearch).

The command file for an ordinary CorpusSearch query has the following schematic form. Obligatory components are underlined:

Here is an example:

// node specification
node: IP-MAT*

// "ignore" command
ignore_nodes: null

// output specification
print_indices: t

// definition file
define: mideng.def

// command specification - find short examples for slides
query: (IP-MAT* iDomsTotal< 10)

The commands in the preamble must all precede the search specification, but the various components of the preamble can appear in any order. They are read in order. This means that the node specification can reference a label from a definition file, but only if "define:" precedes "node:" in the command file.

Notational variants for Boolean values

CorpusSearch allows the following notational variants for Boolean values:

true t, true, T, TRUE
false f, false, F, FALSE

Node specification

The issue of node choice is perhaps less confusing than it used to be given the advent of coding queries, where the effects of node choice are more transparent.

The "node" command specifies the search domain within which CorpusSearch attempts to execute the search specification. Possible values for the node specification are:

The choice of node boundary determines the following: We illustrate this by running the same query on the same simple sentence, but with different node boundaries. Given the following query:
node: IP-MAT*

query: (NP* iDoms PRO*)
CorpusSearch counts 1 hit because even though there are 2 NP* nodes that match the query, they are contained within a single instance of IP-MAT*.
/~*
and he made them grete chere out of mesure
(CMMALORY,2.13)
*~/
/*
1 IP-MAT:  4 NP-SBJ, 5 PRO
1 IP-MAT:  9 NP-OB2, 10 PRO
*/
(0 (1 IP-MAT (2 CONJ and)
	     (4 NP-SBJ (5 PRO he))
	     (7 VBD made)
	     (9 NP-OB2 (10 PRO them))
	     (12 NP-OB1 (13 ADJ grete) (15 N chere))
	     (17 ADVP (18 ADV out)
		      (20 PP (21 P of)
			     (23 NP (24 N mesure)))))
  (26 ID CMMALORY,2.13)) 

/*
SUMMARY:  
source files, hits/tokens/total
  CMMALORY			1/1/1
whole search, hits/tokens/total
				1/1/1
*/
Running the same query with node boundary NP* yields a different result:
node: NP*

query: (NP* iDoms PRO*)
Now CorpusSearch counts 2 hits, because each of the 2 instances of NP* is an instance of the node boundary:
/~*
and he made them grete chere out of mesure
(CMMALORY,2.13)
*~/
/*
4 NP-SBJ:  4 NP-SBJ, 5 PRO
9 NP-OB2:  9 NP-OB2, 10 PRO
*/
(0 (1 IP-MAT (2 CONJ and)
	     (4 NP-SBJ (5 PRO he))
	     (7 VBD made)
	     (9 NP-OB2 (10 PRO them))
	     (12 NP-OB1 (13 ADJ grete) (15 N chere))
	     (17 ADVP (18 ADV out)
	              (20 PP (21 P of)
		             (23 NP (24 N mesure)))))
    (26 ID CMMALORY,2.13))
/*
SUMMARY:  
source files, hits/tokens/total
  CMMALORY		2/1/1
whole search, hits/tokens/total
			2/1/1
*/
The way that CorpusSearch counts hits in this case becomes particularly clear if we set "nodes_only" to "true":
node: NP*

nodes_only: t

query: (NP* iDoms PRO*)
This yields the following output:
/~*
and he made them grete chere out of mesure
(CMMALORY,2.13)
*~/
/*
4 NP-SBJ:  4 NP-SBJ, 5 PRO
9 NP-OB2:  9 NP-OB2, 10 PRO
*/
( (4 NP-SBJ (5 PRO he))
  (26 ID CMMALORY,2.13)) 

( (9 NP-OB2 (10 PRO them))
  (26 ID CMMALORY,2.13)) 
/*
SUMMARY:  
source files, hits/tokens/total
  CMMALORY		2/1/1
whole search, hits/tokens/total
			2/1/1
*/

Ignoring nodes in searches

Certain node labels in the corpus indicate metalinguistic or other information that should generally not be considered in searches. For instance, an intervening editorial comment, line break, page number, and the like should not affect searches involves immediate precedence. It is also often (though not always) desirable to ignore punctuation, traces, other empty categories, and clitics. Some of these items should also generally be ignored when counting the number of words in a constituent. In deciding whether a given structure in the corpus matches a query, CorpusSearch ignores nodes with labels on the "
ignore_nodes" list. In connection with domsWords and its variants, CorpusSearch uses the "ignore_words" list instead. Labels can be added to these lists using "add_to_ignore" and "add_to_ignore_words", respectively, or they can be revised more extensively (including the option of not ignoring any nodes or words), as discussed under the commands at issue. Any of the commands mentioned in this paragraph can be specified in the preference file or in individual command files, where, as usual, they override any preference file specifications.

The following table is intended as a convenient summary of the commands in this section. For more details, see the entries for the individual commands.

The spaces setting off the alternatives in the following table are intended to improve legibility; they need to be omitted in an actual query (or they will abort the search).

Command In conjunction with ... Default value
ignore_words domsWords and variants CODE | COMMENT | E_S | ID | LB | RMV:* | ' | \" | , | \. | / | 0 | \**
ignore_nodes all other commands same as above, but not the last two items
add_to_ignore_words ignore_words no default
add_to_ignore ignore_nodes

ignore_nodes: {list_of_labels}

The spaces setting off the alternatives in the following list are intended to improve legibility; they need to be omitted in an actual query (or they will abort the search).

Default: COMMENT | CODE | E_S | ID | LB | RMV:* | ' | \" | , | \. | /

In some corpora, punctuation is tagged as, say, PUNC or PON or PONFP. Those tags are not on the default "ignore_nodes" list, and so they must be added to the list by the user (best in the preference file).

"ignore_nodes" tells CorpusSearch to ignore the specified nodes in connection with all search functions except for domsWords and its variants. (For the latter, CorpusSearch uses the list specified by ignore_words.) For instance, running this query:

(NP* iPrecedes PP*)
returns the following sentence despite the CODE node intervening between NP-1 and PP. This is because the label "CODE" is on the default "ignore_nodes" list.
/*
1 IP-MAT-SPE: 5 NP-1, 9 PP
*/
/~*
There ar two bretheren beyond the see,
(CMMALORY,15.439)
*~/

(0 (1 IP-MAT-SPE (2 NP-SBJ-1 (3 EX There))
                 (5 BEP ar)
                 (7 NP-1 (8 NUM two) (10 NS bretheren))
                 (12 CODE <P_15>)
                 (14 PP (15 P beyond)
                        (17 NP (18 D the) (20 N see)))
                 (22 E_S ,))
   (24 ID CMMALORY,15.439))

You can replace the default list with a list of your own choosing, and you can also tell CorpusSearch not to ignore any nodes with this command:

ignore_nodes: null

If your query makes reference to an item on your "ignore_nodes" list, CorpusSearch will not issue a warning. The only indication you will get of the incoherent character of your query is the puzzling absence of hits in the output, as illustrated in what follows.

Here is an example of an incoherent query that makes reference to a node on the ignore_nodes list:

node:   IP*

ignore_nodes: NP*

query:  (NP* iDoms PRO*)
Running the query on the following input:
(0 (1 IP-MAT (2 CONJ and)
	     (4 NP-SBJ (5 PRO he))
	     (7 VBD made)
	     (9 NP-OB2 (10 PRO them))
	     (12 NP-OB1 (13 ADJ grete) (15 N chere))
	     (17 ADVP (18 ADV out)
	              (20 PP (21 P of)
		             (23 NP (24 N mesure)))))
    (26 ID CMMALORY,2.13))
yields an output file without any hits.
node:  IP*
query:  (NP* iDoms PRO*) 
*/
/*
HEADER:
source file:  CMMALORY
*/
/*
FOOTER
  source file, hits/tokens/total
  CMMALORY		0/0/1
*/
/*
SUMMARY:  
source files, hits/tokens/total
  CMMALORY		0/0/1
whole search, hits/tokens/total
			0/0/1
*/

This is because CorpusSearch is literally following your instruction to ignore all NP* nodes, including the NP* nodes with node addresses 4 and 9, which would otherwise match the query.

add_to_ignore: {list_of_labels}

Default: n/a

"add_to_ignore" adds nodes to the "ignore_nodes" list (whether the default list or your own). For instance:

add_to_ignore: INTJ*              ← "add_to_ignore", not "add_to_ignore_nodes"

ignore_words: {list_of_labels}

The spaces setting off the alternatives in the following list are intended to improve legibility; they need to be omitted in an actual query (or they will abort the search).

Default: COMMENT | CODE | E_S | ID | LB | RMV:* | ' | \" | , | \. | / | 0 | \**

"ignore_words" tells CorpusSearch what nodes to ignore when counting words in connection with domsWords and its variants. (For all other search functions, CorpusSearch uses the list specified in ignore_nodes.) The default list for "ignore_words" is the same list as for "ignore_nodes", except that it also includes "0" and traces and empty categories (expressions beginning with asterisk). Nodes on the "ignore_words" list can be terminals or preterminals (in other words, either elements of the text itself or their associated POS tag).

As with the ignore_nodes list, you can edit the "ignore_words" list, replacing the default list with a list of your own or with the value "null" (if you want to count all terminal nodes in the text, without ignoring any).

add_to_ignore_words: {list_of_labels}

Default: n/a

"add_to_ignore_words" add nodes to the "ignore_words" list (whether the default list or your own). For instance:

add_to_ignore_words: [oO]h

Corpus encoding

Default: corpus_encoding: US-ASCII

CorpusSearch also supports ISO-LATIN-1 (a.k.a ISO-8859-1), UTF-8, and UTF-16.

Output format commands

Output format commands control how search results are printed to the output file. They do not influence the results of a current search. However, because they can cause the output of the current search to include certain nodes or not, they may influence the contents of searches that depend on the output of the current one.

The following table is intended as a convenient summary of the commands in this section. For more details, see the entries for the individual commands.

Command Default value
begin_remark ... end_remark n/a
nodes_only false
print_complement
print_indices
remove_nodes
ur_text_only

begin_remark: {remark string} end_remark

Default: n/a

This two-part command tells CorpusSearch to print a remark in the preface to the output; for instance, a note about the purpose of a particular search. The following command:

begin_remark: 
pronoun objects
end_remark
would give rise to an output preface as follows:
/*
    PREFACE:  regular output file.
    CorpusSearch copyright Beth Randall 1999.
    Date:  Wed Nov 03 19:12:03 EST 1999

    command file:       pro-obj.q
    input file:         ip-mat-2vb.out
    output file:        pro-obj.out

    remark:
        pronoun objects

    node:   IP*
    query:  (NP-OB* iDoms PRO)
*/

nodes_only: (Boolean true or false)

Default: false

When the output of some initial search that sets nodes_only to "true" becomes the input to subsequent searches, $ROOT in the subsequent searches refers to the root of the output trees, whose label is determined by (though not necessarily identical with) the node boundary of the initial search.

If "nodes_only" is set to true, CorpusSearch prints out only the nodes that match the query. If it is set to false, CorpusSearch prints out the entire sentence token containing the structure matching the query. For instance, with the default value of "false", the following query:

node: ADVP*

nodes_only: f

query:  (ADVP* iDoms ADVP*)
gives the following output:
/~*
certayn and wit-owte doute, Ihon is is name.
(CMAELR3,45.589)
*~/

/*
2 ADVP: 3 ADVP
*/

(0 (1 IP-MAT (2 ADVP (3 ADVP (4 ADV certayn))
                     (6 CONJP (7 CONJ and)
                              (9 PP (10 P wit-owte)
                                    (12 NP (13 N doute)))))
             (15 , ,)
             (16 NP-OB1 (17 NPR Ihon))
             (19 BEP is)
             (21 NP-SBJ (22 PRO$ is) (24 N name))
             (26 E_S .))
  (28 ID CMAELR3,45.589))
The output for the same query, but with "nodes_only" set to "true", looks like this:
/~*
certayn and wit-owte doute, Ihon is is name.
(CMAELR3,45.574)
*~/

/*
2 ADVP: 3 ADVP
*/

(0 (2 ADVP (3 ADVP (4 ADV certayn))
           (6 CONJP (7 CONJ and)
               	    (9 PP (10 P wit-owte)
             	          (12 NP (13 N doute))))
   (15 , ,))
   (28 ID CMAELR3,45.574))

print_complement: (Boolean true or false)

Default: false

The effect of "print_complement" can now be achieved (more easily and transparently) with coding queries, notably with the ELSE condition.

CorpusSearch ordinarily outputs only nodes or tokens that match the search specification. Setting "print_complement" to "true" causes CorpusSearch to print the matching tokens to the regular output file (with the extension .out) and also all the tokens that don't match to a separate complement file (with the extension .cmp). In other words, "print_complement" applies logical NOT to an entire query.

For instance, the following query could be used on an input file containing all IPs with objects in order to divide those IPs into two complementary sets: those with two objects (in the .out file) and the remainder (with at most one object).

print_complement: t

node: IP*

query:    (IP* iDoms NP-OB1)
      AND (IP* iDoms NP-OB2)
Here is an example from the regular output file. It matches the query and has two objects:
/~*
And there is no knyght now lyvynge that ought to yelde God so grete thanke os ye,
(CMMALORY,655.4474)
*~/
/*
1 IP-SUB-SPE: 10 NP-OB2, 13 NP-OB1
1 IP-SUB-SPE: 13 NP-OB1, 10 NP-OB2
*/

(0 (1 IP-SUB-SPE (2 NP-SBJ *T*-2)
                 (4 MD ought)
                 (6 TO to)
                 (8 VB yelde)
                 (10 NP-OB2 (11 NPR God))
                 (13 NP-OB1 (14 ADJP (15 ADVR so) (17 ADJ grete))
                            (19 N thanke)
                            (21 PP (22 P os)
                                   (24 NP (25 PRO ye)))))
   (27 ID CMMALORY,655.4474))
And here is an example from the complement file. It has only one object and so fails to match the query.
/~*
The kynge lyked and loved this lady wel,
(CMMALORY,2.12)
*~/

(0 (1 IP-MAT (2 NP-SBJ (3 D The) (5 N kynge))
             (7 VBD (8 VBD lyked) (10 CONJ and) (12 VBD loved))
             (14 NP-OB1 (15 D this) (17 N lady))
             (19 ADVP (20 ADV wel))
             (22 E_S ,))
   (24 ID CMMALORY,2.12))

The "print_complement" command should only be used on the output of a previous search that has yielded some set that it is sensible to divide into two complementary subsets. Using a query with "print_complement" on an entire corpus will generally result in a .cmp file consisting of a grab bag of tokens without any theoretical significance.

print_indices: (Boolean true or false)

Default: false

"print_indices" tells CorpusSearch whether to print node address indices in the output. Node address indices should not be confused with prefix indices (though prefix indices use addresses to fix the reference of search arguments). Node address indices start at 0 for $METAROOT and systematically label every node in the tree, including the word (terminal nodes). By convention, addresses for terminals are not printed even when "print_indices" is set to "true".

Here's a piece of output structure with and without node addresses:

(10 NP-OB1 (11 NPR Morgan)     ← indices 12 "Morgan", 14 "le", 16 "Fay" omitted by convention
           (13 NPR le)
           (15 NPR Fay))

(NP-OB1 (NPR Morgan)
        (NPR le)
        (NPR Fey))
Output including index queries can be a bit cumbersome to read, but when troubleshooting a query that isn't having the intended effect, indices in the tree structures are invaluable pointers for understanding what CorpusSearch is matching and where the query needs to be revised. This is also true more generally in connection with complex queries or with corpora with very long sentence tokens.

remove_nodes: (Boolean true or false)

Default: false

"remove_nodes" removes subtrees rooted in the same syntactic category as the node boundary when the subtrees are themselves dominated by an instance of that category (see below for details as to what counts as the "same"). The removed subtree is replaced by a label indicating the material that has been removed. If the removed subtree itself matches the query, it will appear in its complete form as a separate output token later in the output file.

CorpusSearch determines the syntactic category of the nodes to be removed according to the following algorithm:

For instance, all of the following node boundaries - IP-MAT-PRN*, IP*, IP - yield IP* as the syntactic category to be removed.

All NP node boundaries yield NP* as the node to be removed. In addition to matching NPs, this will also match the POS tag NPR, which is likely not intended. To avoid this problem, add a hyphen ("NP-*") to the node boundary. A more radical solution is to replace "NPR" by a non-interfering label like "NR" via a revision query or a sed script.

In the following command file, "remove_nodes" is set to "true":

node: IP*

remove_nodes: true

query: (NP-OB* iDoms PRO)

Running it on the following input

( (IP-MAT (CONJ and)
          (NP-SBJ *con*)
          (VBD counceilled)
          (NP-OB2 (PRO hym))
          (IP-INF (TO to)
                  (VB folowe)
                  (NP-OB1 (PRO hem))
                  (NP-MSR (ADJP (Q no) (ADJR further))))
          (PUNC .))
  (ID CMMALORY-M4,14.416))
yields the following output:
/~*
and counceilled hym to folowe hem no further.
(CMMALORY-M4,14.416)
*~/
/*
1 IP-MAT:  8 NP-DTV, 9 PRO
11 IP-INF:  16 NP-ACC, 17 PRO
*/
( (1 IP-MAT (2 CONJ and)
            (4 NP-SBJ *con*)
            (6 VBD counceilled)
            (8 NP-OB2 (9 PRO hym))
	    (11 IP-INF RMV:to_folowe_hem...)
            (25 PUNC .))
  (27 ID CMMALORY-M4,14.416))

( (11 IP-INF (12 TO to)
             (14 VB folowe)
             (16 NP-OB1 (17 PRO hem))
             (19 NP-MSR (20 ADJP (21 Q no) (23 ADJR further))))
  (27 ID CMMALORY-M4,14.416))
The structure of the infinitival IP "to folowe hem no further" is removed in the first hit and replaced with the label "RMV:{rmv_string}", where "rmv_string" stands for the concatenation of (up to) the first three words (terminal nodes) of the removed material.

ur_text_only: (Boolean true or false)

Default: false

"ur_text" refers to the text in its unannotated form. "ur_text_only" prints only the ur_text of the tokens matching the query, suppressing printing of the labeled bracketing and associated information.

Command specification

query

Every command file must contain a command specification, which instructs CorpusSearch as to what action to carry out on the input. The command specification for ordinary queries is:
query: { condition(s) }
The conditions contain one or more
search functions in accordance with the syntax of the CorpusSearch query language.

reformat_corpus

A command file consisting of only the following command specification (without a preamble):
reformat_corpus: t
normalizes the format of an input file (adjusting indentation, spacing, and so on). The output has the extension .fmt, which is appended to the full name of the input file, including any existing extensions.

The "reformat_corpus" command can be thought of as applying a vacuous query to a file. The result is an output file formatted according to the same rules as would apply with an ordinary query.

"reformat_corpus" is useful for following up with Unix "diff" when comparing two versions of the same file or corpus - say, before and after running a corpus revision query.

Comments

Comments may appear anywhere in a command file. They are also allowed anywhere in the input file(s).

By default, single-line comments are introduced by "//", and multi-line comments (block comments) appear between "/*" and "*/". These comment delimiters can be changed for input files, but not for command files.

The command "corpus_line_comment" changes the line comment delimiter. For instance:

corpus_line_comment: \\
The paired commands "corpus_comment_begin" and "corpus_comment_end" change the block comment delimiters. For instance:
corpus_comment_begin: <+
corpus_comment_end: +>