Coding queries


Coding queries are used to create input for multivariate analysis programs like Varbrul; general statistical programming environments like S, Splus, and R; and statistical analysis packages like Datadesk, JMP, SAS, and SPSS. They are extremely powerful and flexible tools. In particular, they allow us to circumvent the difficulties inherent in using OR and NOT.

Coding query files must have the extension .c (preferred) or .q (deprecated). By default, their output has the same basename as the coding query file, but with the extension .cod. The basename of the output file (but not the .cod extension) can be changed with the same the "-out" switch as for ordinary query files (see Installing CorpusSearch).

CS foo.c bar.psd                    ← default output = foo.cod

CS foo.c bar.psd -out bar.cod

The command file for a coding query follows exactly the same conventions as for an ordinary search query, except that the command specification is "coding_query" (rather than "query"). In addition, the contents of the command specification essentially consist of multiple ordinary queries, as described in more detail below. Obligatory components are underlined.

Here is an example:

// node specification
node: IP-MAT*

// "ignore" command
ignore_nodes: null

// output specification
print_indices: t

// definition file
define: mideng.def

// command specification - classify sentences by length
coding_query:

1: {
     short:   (IP-MAT* iDomsTotal< 10)
     medium:  (IP-MAT* iDomsTotal< 20)
     long:     ELSE
}
The output file of a coding query contains every token of the input file, with a coding node inserted at every boundary node as the first daughter of that node.

Including "copy_corpus: t" in a coding query would be redundant, and in fact, it aborts the search.

Sample command specification

The command specification for a coding query consists of the command "coding_query" followed by a colon and a list of one or more coding columns.
coding_query:

column_number: {
                 label-1: condition(s)
                 label-2: condition(s)
                 .
                 .
                 .
               }

Here is a sample coding file that generates a coding string with four columns.

node: IP-*

coding_query:

1: {
        spe: (IP-*SPE* iDoms NP-SBJ*)
        -:   ELSE
   }

2: {
        mat: (IP-MAT* iDoms NP-SBJ*)
        sub: (IP-SUB* iDoms NP-SBJ*)
        inf: (IP-INF* iDoms NP-SBJ*)
        -:   ELSE

   }

3: {
        neg: (IP* iDoms NEG)
        pos: (IP* iDoms !NEG)
        -:   ELSE
   }

4: {
        \1: (NP-SBJ* domsWords 1)
        \2: (NP-SBJ* domsWords 2)
        \3: (NP-SBJ* domsWords> 2)
        \0: ELSE
   }

In this example, column 1 of the coding string will contain "spe" if there is an IP-*SPE* that immediately dominates a subject. If not, the column will contain "-", as specified by the ELSE condition (used only in coding queries). In the absence of an explicit ELSE condition, if none of the stated conditions for the column are met, CorpusSearch fills the column by default with "_" (underscore).

The conditions for each column are evaluated in order. As soon as a condition is met, the appropriate value is added to the coding string, and CorpusSearch moves on to the next column. Depending on the input data and the logical relationships among the conditions being evaluated, the order in which they appear in the query can have an effect on the coding string. For instance, if sentences can contain both "ne" and "not", then the following conditions will code "ne ... not" sentences as "ne".

3: {
        ne:  (IP* iDoms NEG) AND (NEG iDoms ne)
        not: (IP* iDoms NEG) AND (NEG iDoms not)
        pos: (IP* iDoms !NEG)
        -:   ELSE
   }
Reversing the order of the "ne" and the "not" conditions would lead to the "ne ... not" sentences being coded as "not". In the case at hand, neither of these options is probably the desired one. Rather, it is probably best to distinguish the "ne ... not" cases from the simple "ne" and the simple "not" cases with conditions along the following lines:
3: {
        ne-not:     (IP* iDoms [1]NEG) AND ([1]NEG iDoms ne)
                AND (IP* iDoms [2]NEG) AND ([2]NEG iDoms not)
        ne:         (IP* iDoms NEG) AND (NEG iDoms ne)
        not:        (IP* iDoms NEG) AND (NEG iDoms not)
        pos:        (IP* iDoms !NEG)
        -:          ELSE
   }
In this query, the order of the simple "ne" and simple "not" conditions has no effect on the coding of the "ne ... not" sentences (provided, of course, that the "ne-not" condition precedes the other two).

When numerals, including dates, are used as values in a coding string, they must be escaped with backslash ("\") in the query, as illustrated for column 4 in the sample file. In the coding string, the numerals appear without the backslash.

Sample output

As mentioned earlier, the output file of a coding query contains every token of the input file, with a coding node inserted at every boundary node as the first daughter of that node. Coding nodes have the form:
(CODING-<node_label> <coding_string>)
The "node_label" suffix of the CODING node is the full label of the current instance of the node boundary. For instance, if the node boundary is IP*, and the particular instantiation is IP-SUB-PRN-SPE, then the label for that IP's CODING node is CODING-IP-SUB-PRN-SPE. If a sentence token contains more than one instance of the boundary node, the output token will contain multiple coding nodes.

Here is what an output token resulting from the sample coding query from the previous section would look like:

/~*
They do not know the value of a dollar.
*~/

( (IP-MAT (CODING-IP-MAT -:mat:neg:1)
	  (NP-SBJ (PRO They))
          (DOP do)
          (NEG not)
          (VB know)
          (NP-OB1 (D the)
                  (N value)
		  (PP (P of)
		      (NP (D a)
			  (N dollar))))
	  (. .)))

Searching coding strings

Coding strings may be searched using the
column function. For instance, the following query would find the sample coded sentence above:
query:  (CODING-IP-MAT* column 2 mat|sub)
Coding queries can themselves search already existing coding strings. For instance, assume a coded corpus with three columns. Column 1 codes whether the clause-initial constituent is the subject ("subj"), the finite verb ("verb"), or the direct object ("obj"). Columns 2 and 3 code the same information for the second and third constituents in the sentence. The following coding query could then be run on the coded corpus to classify sentences by clause type based on the information in the already existing three columns:
4: {
     svo:     (CODING-IP* column 1 subj) 
          AND (CODING-IP* column 2 verb) 
          AND (CODING-IP* column 3 obj) 
     sov:     (CODING-IP* column 1 subj) 
          AND (CODING-IP* column 2 obj) 
          AND (CODING-IP* column 3 verb) 

     osv: ...
     ovs: ...
     vos: ...
     vso: ...
     -:   ELSE
}
The query for column 4 must be stored in a coding query file distinct from the one for columns 1-3 and run separately. If all four queries are combined in a single coding query file, CorpusSearch will code the input file, but it will assign its default elsewhere value "_" to column 4 throughout, since the input file doesn't contain the CODING nodes referenced by the conditions for column 4.

Extracting coding strings

A command file containing only the following command specification (without a
preamble) extracts all of the coding strings in a coded file:
print_only: CODING*

The resulting output file has the extension .ooo.

The trailing asterisk on CODING is necessary because of the suffix on each CODING node that specifies its associated syntactic category. Coding strings for particular categories can be extracted with a "print_only" command that makes specific reference to those categories. For instance:

print_only: CODING-IP*

print_only: CODING-IP-MAT*

In coded files generated by older versions of CorpusSearch (before version 74), coding strings lacked node label suffixes. For such files, the trailing asterisk is not necessary:

print_only: CODING

It is generally useful to append a sentence token's ID information to each extracted coding string. This can be done by including a second command:

print_only: CODING*

add_IDs: t
In the resulting output file, each coding string is followed by an "@" sign, which in turn is followed by the ID information. For instance:
1182:verse:v2:fullNP:transitive@ROLAND,3.102

When exporting the coding string, "@" needs to be added to the list of column delimiters (or it needs to be replaced by ":").

Manually editing coding strings

In general, the coding strings that CorpusSearch generates completely algorithmically are used "as is" as input to further statistical analysis. But in connection with certain research questions, it is desirable to encode information about properties that are not generally annotated in the corpus. In such cases, it is possible to edit the coding strings by hand. For instance, a coding query might be formulated to generate a column for information concerning a noun phrase's discourse status ("old", "new", etc.), filling it with a default elsewhere value ("-"), the statistically most likely value, or whatever value seems most useful. This column can then be reviewed and corrected by hand, and the resulting file can form the input to further CorpusSearch searches or to statistical analysis.

Best practice: In the situation just described, the wisest course is to add the information to (a copy of) the corpus itself. For instance, discourse status could be indicated by adding "-OLD" or "-NEW" to existing NP labels (NP-SBJ-OLD, NP-OB1-NEW, etc.). The work required to add the information manually is the same, whether the information is added to the corpus annotation or to the coding string. The information is safer in the annotation (it can't be accidentally overwritten by re-running the coding query on the corpus), and once it is add there, it is easy to retrieve algorithmically with appropriate coding queries.

Tricks of the trade