Understanding CorpusSearch output

General form of the output
Sample output file
Using nodes_only and remove_nodes

General form of the output

By default, the name of an output file is the same as the input file, but with the (additional) extension .out.

Output files have the following general form:

Each output file contains a preface and a summary.
For each input file reported in the output file, there is a header and a footer.
Each token in the input file that matches the query is associated with the following information:
- the original text (or ur-text),
- a result block specifying the nodes that match the query, and
- the parsed structure itself.
Here and throughout, "token" is intended to refer to a parsed expression surrounded by unlabeled parens, also known as wrapper parens).

Sample output file

In what follows, we walk through a sample output file resulting from a query that searches for inverted pronoun subjects - that is, pronoun subjects that follow the tensed verb. (The query is somewhat simplified compared with what would be required for an actual search on Middle English files.)

Preface

/*
    PREFACE:
    CorpusSearch copyright Beth Randall 2004.
    Date:  Sun Apr 30 07:05:51 EDT 2004

    command file:       inversion.q
    output file:        inversion.out

    remark:   This query searches for inverted pronoun subjects.

    node:   IP*

    print_indices: true

    query:      (ADJP*|ADVP*|NP*|PP* iPrecedes BE[DP]|DO[DP]|HV[DP]|MD|VB[DP])
            AND (ADJP*|ADVP*|NP*|PP* iDoms !\*T*)
            AND (BE[DP]|DO[DP]|HV[DP]|MD|VB[DP] iPrecedes NP-SBJ*)
            AND (NP-SBJ* iDoms PRO)

*/

The preface begins with a copyright declaration and the date and time of the search.

It then lists the names of the command file and output file.

The optional remark serves as a reminder of the purpose of the query.

The preface then prints a copy of the query. Any labels from a definition file are expanded according to their definitions at runtime (this prevents confusion resulting from any subsequent changes to the definition file). For instance, lines 1 and 3 of the above query might have looked as follows in the original query file:

    (ADJP*|ADVP*|NP*|PP* iPrecedes finiteVerb)
AND ...
AND (finiteVerb iPrecedes NP-SBJ*)
AND ...

(In that case, the preface would also contain a reference to the relevant definition file.)

Header

The header lists each source file with its name as it appears in the corpus directory. If the source file is itself the output of a CorpusSearch run, the name appears as it appears in the ID nodes for that file (in this case, CMCAPCHR). In the absence of an ID node, the filename is reported as NULL. For more information on searching output files, see Using nodes_only and remove_nodes.

/*
    HEADER:
    source file:  cmcapchr.m4.psd
*/

Output token

Each output token is divided into three parts:

the original text (also known as the ur-text)
the result block, which specifies the nodes that match the query
the token in its parsed form

Here is a short example:

/~*
His fadir scheep kepte he ful mekly;
(CMCAPCHR,32.13)
*~/

/*
    1 IP-MAT: 2 NP-OB1, 7 VBD kepte, 6 NS scheep, 8 NP-SBJ, 9 PRO he
*/

(0  (1 IP-MAT (2 NP-OB1 (3 NP-POS (4 PRO$ His) (6 N$ fadir))
			(8 NS scheep))
	      (10 VBD kepte)
	      (12 NP-SBJ (13 PRO he))
	      (15 ADVP (16 ADVR ful) (18 ADV mekly))
	      (20 E_S ;))
    (22 ID CMCAPCHR,32.13))

The original text is surrounded by special markers, "/~*" and "*~/". If a subsequent search is run on the output file, CorpusSearch finds and records this block of data. In this way, the original text of the entire token is conserved across multiple searches (with possibly different boundary nodes).

The result block is surrounded by special markers of its own, "/*" and "*/". The first item in the result block is the boundary node (in this case, 1 IP-MAT), which matches the value of the node specification in the command file. The boundary node is separated by a colon from the rest of the list, which specifies the nodes in the structure that match the nodes specified in the body of the query.

The numbers preceding the nodes in the result block are the indices on the nodes of the labeled bracketing. These indices are intended to clarify how the token comes to match the query. They are always reported in the result block. In the labeled bracketing, they are explicitly included when the preamble includes the line "print_indices: true".

In cases where more than one node matches a single search-function argument, the result block generally reports the first matching node. But in the case of negated arguments, the result block reports the last matching node (since the logic of negation requires the query to check every possible match). For instance, the argunent "!\*T*" in the second line of the sample query matches both "3 NP-POS" and "8 NS" in the sample input, and CorpusSearch reports the second node.

The parsed version of the output token is formatted to show the structure of the tree. Sisters have the same indentation (for instance, "2 NP-OB1" and "10 VBD"). Daughters are indented further than their mothers. If a node dominates only leaves, they are printed on the same line to save space.

Footer

The footer gives the statistics for hits, tokens, and total as found in a particular input file.

/*
FOOTER
  source file, hits/tokens/total
cmcapchr.m4.psd     220/220/4175
*/

This same information appears again as a line in the summary.

Hits/tokens/total

CorpusSearch reports the following statistics:

hits - number of distinct boundary nodes contaning the searched-for structure
tokens - number of tokens containing hits
total - total number of independent parsed expressions searched

Tokens can contain several distinct boundary nodes. For instance, a root IP-MAT might contain several NP nodes, each of which counts as a boundary node in a query with the value "NP*" for "node". If more than one of these NPs match the query, the single token in question contains more than one hit. It is thus possible (and indeed common) for the number of hits to exceed the number of tokens. But by definition, the number of tokens cannot exceed the number of hits; it can only ever equal it.

Summary

The summary repeats the information in the footers for each individual input file, but reported together in one place - at the end of the output file. The following sample summary was produced by a search on all files in the Middle English corpus (PPCME2) from the fourth time period (1420-1500), indicated by the "m4" in the filename.

/*
SUMMARY:  
source files, hits/tokens/total:
  cmaelr4.m4.psd    46/46/766
  cmcapchr.m4.psd   220/220/4175
  cmcapser.m4.psd   12/12/91
  cmedmund.m4.psd   2/2/300
  cmfitzja.m4.psd   14/14/228
  cmgregor.m4.psd   14/14/2631
  cminnoce.m4.psd   6/6/208
  cmkempe.m4.psd    203/202/3851
  cmmalory.m4.psd   214/213/4995
  cmreynar.m4.psd   36/36/547
  cmreynes.m4.psd   0/0/245
  cmsiege.m4.psd    6/6/731
whole search, hits/tokens/total   773/771/18772
*/

Using nodes_only and remove_nodes

The following section is largely obsolete given the availability of coding queries.

Consider the following query (2vb.q), which searches for matrix clauses containing a subject that precedes at least two verbs (including modals). We begin with a version of this query that sets "nodes_only" to "f"; "remove_nodes" is then necessarily set to "f" as well. Later on, we discuss the effect of setting these commands to "t".

nodes_only: f

// the following command is commented out since it is redundant
// remove_nodes: f

node:  IP-MAT*

query:     (IP-MAT* iDoms NP-SBJ*)
       AND (NP-SBJ* hasSister BE[DP]|DO[DP]|HV[DP]|VB[DP]|MD)
       AND (NP-SBJ* precedes BE[DP]|DO[DP]|HV[DP]|VB[DP]|MD)
       AND (NP-SBJ* hasSister BE|BEN|DO|D[AO]N|HV|H[AV]N|VB|VA[GN])
       AND (NP-SBJ* precedes BE|BEN|DO|D[AO]N|HV|H[AV]N|VB|VA[GN])

Here is a sample sentence that matches the query.

/~*
They would have told you if you had visited them.
*~/
/*
1 IP-MAT:  1 IP-MAT, 2 NP-SBJ, 5 MD, 7 HV
*/
(0  (1 IP-MAT (2 NP-SBJ (3 PRO They))
	      (5 MD would)
	      (7 HV have)
	      (9 VBN told)
	      (11 NP-OB2 (12 PRO you))
	      (14 PP (15 P if)
		     (17 CP-ADV (18 C 0)
				(20 IP-SUB (21 NP-SBJ (22 PRO you))
					   (24 HVD had)
					   (26 VBN visited)
					   (28 NP-OB1 (29 PRO them)))))
	      (31 . .)))

Let's now run the following query to find pronominal objects (pro-obj.q) on the above sentence.

node: NP-OB*

nodes_only: f

query: (NP-OB* iDoms PRO)

This gives the following output:

/~*
They would have told you if you had visited them.
*~/
/*
11 NP-OB2:  11 NP-OB2, 12 PRO
28 NP-OB1:  28 NP-OB1, 29 PRO
*/
(0  (1 IP-MAT (2 NP-SBJ (3 PRO They))
	      (5 MD would)
	      (7 HV have)
	      (9 VBN told)
	      (11 NP-OB2 (12 PRO you))
	      (14 PP (15 P if)
		     (17 CP-ADV (18 C 0)
				(20 IP-SUB (21 NP-SBJ (22 PRO you))
					   (24 HVD had)
					   (26 VBN visited)
					   (28 NP-OB1 (29 PRO them)))))
	      (31 . .)))
/*
FOOTER
  source file, hits/tokens/total
  NULL			2/1/1
*/

The query matches the two pronouns in the one token and therefore reports 2 hits and 1 token. But what if we want to restrict the second query to only the boundary nodes that match the first query? In other words, what if we want to find only pronominal objects in matrix clauses with subjects that precede at least two verbs (without writing a single query to that effect)? This can be done by setting the values of "nodes_only" and "remove_nodes" in 2vb.q to "t", resulting in 2vb-mod.q.

nodes_only: t

remove_nodes: t

node:  IP-MAT*

query:     (IP-MAT* iDoms NP-SBJ*)
       AND (NP-SBJ* hasSister BE[DP]|DO[DP]|HV[DP]|VB[DP]|MD)
       AND (NP-SBJ* precedes BE[DP]|DO[DP]|HV[DP]|VB[DP]|MD)
       AND (NP-SBJ* hasSister BE|BEN|DO|D[AO]N|HV|H[AV]N|VB|VA[GN])
       AND (NP-SBJ* precedes BE|BEN|DO|D[AO]N|HV|H[AV]N|VB|VA[GN])

The relevantly different output of 2vb-mod.q is as follows:

( (1 IP-MAT (2 NP-SBJ (3 PRO They))
	    (5 MD would)
	    (7 HV have)
	    (9 VBN told)
	    (11 NP-OB2 (12 PRO you))
	    (14 PP (15 P if)
		   (17 CP-ADV (18 C 0)
			      (20 IP-SUB RMV:you_had_visited...)))
	    (31 . .))
  )

Running pro-obj.q on the output of 2vb-mod.q results in the following output; again, only the relevant different parts are shown.

/*
11 NP-OB2:  11 NP-OB2, 12 PRO
*/
(0  (1 IP-MAT (2 NP-SBJ (3 PRO They))
	      (5 MD would)
	      (7 HV have)
	      (9 VBN told)
	      (11 NP-OB2 (12 PRO you))
	      (14 PP (15 P if)
		     (17 CP-ADV (18 C 0)
				(20 IP-SUB RMV:you_had_visited...)))
	      (22 . .)))
/*
FOOTER
  source file, hits/tokens/total
  NULL			1/1/1
*/

Because "remove_nodes: t" removes the subordinate clause, the query can match only the pronoun in the matrix clause - as desired. The result block therefore contains only 1 line rather than 2, and the number of hits reported is also 1 rather than 2.