Coding queries

Sample command specification
Sample output
Searching coding strings
Extracting coding strings
Manually editing coding strings
Tricks of the trade

Coding queries are used to create input for multivariate analysis programs like Varbrul; general statistical programming environments like S, Splus, and R; and statistical analysis packages like Datadesk, JMP, SAS, and SPSS. They are extremely powerful and flexible tools. In particular, they allow us to circumvent the difficulties inherent in using OR and NOT.

Coding query files must have the extension .c (preferred) or .q (deprecated). By default, their output has the same basename as the coding query file, but with the extension .cod. The basename of the output file (but not the .cod extension) can be changed with the same the "-out" switch as for ordinary query files (see Installing CorpusSearch).

CS foo.c bar.psd                    ← default output = foo.cod

CS foo.c bar.psd -out bar.cod

The command file for a coding query follows exactly the same conventions as for an ordinary search query, except that the command specification is "coding_query" (rather than "query"). In addition, the contents of the command specification essentially consist of multiple ordinary queries, as described in more detail below. Obligatory components are underlined.

Preamble

Node specification
"Ignore" commands
Output format commands
Reference to definition file

Command specification

Here is an example:

// node specification
node: IP-MAT*

// "ignore" command
ignore_nodes: null

// output specification
print_indices: t

// definition file
define: mideng.def

// command specification - classify sentences by length
coding_query:

1: {
     short:   (IP-MAT* iDomsTotal< 10)
     medium:  (IP-MAT* iDomsTotal< 20)
     long:     ELSE
}

The output file of a coding query contains every token of the input file, with a coding node inserted at every boundary node as the first daughter of that node.

Including "copy_corpus: t" in a coding query would be redundant, and in fact, it aborts the search.

Sample command specification

The command specification for a coding query consists of the command "coding_query" followed by a colon and a list of one or more coding columns.

coding_query:

column_number: {
                 label-1: condition(s)
                 label-2: condition(s)
                 .
                 .
                 .
               }

Here is a sample coding file that generates a coding string with four columns.

node: IP-*

coding_query:

1: {
        spe: (IP-*SPE* iDoms NP-SBJ*)
        -:   ELSE
   }

2: {
        mat: (IP-MAT* iDoms NP-SBJ*)
        sub: (IP-SUB* iDoms NP-SBJ*)
        inf: (IP-INF* iDoms NP-SBJ*)
        -:   ELSE

   }

3: {
        neg: (IP* iDoms NEG)
        pos: (IP* iDoms !NEG)
        -:   ELSE
   }

4: {
        \1: (NP-SBJ* domsWords 1)
        \2: (NP-SBJ* domsWords 2)
        \3: (NP-SBJ* domsWords> 2)
        \0: ELSE
   }

In this example, column 1 of the coding string will contain "spe" if there is an IP-*SPE* that immediately dominates a subject. If not, the column will contain "-", as specified by the ELSE condition (used only in coding queries). In the absence of an explicit ELSE condition, if none of the stated conditions for the column are met, CorpusSearch fills the column by default with "_" (underscore).

The conditions for each column are evaluated in order. As soon as a condition is met, the appropriate value is added to the coding string, and CorpusSearch moves on to the next column. Depending on the input data and the logical relationships among the conditions being evaluated, the order in which they appear in the query can have an effect on the coding string. For instance, if sentences can contain both "ne" and "not", then the following conditions will code "ne ... not" sentences as "ne".

3: {
        ne:  (IP* iDoms NEG) AND (NEG iDoms ne)
        not: (IP* iDoms NEG) AND (NEG iDoms not)
        pos: (IP* iDoms !NEG)
        -:   ELSE
   }

Reversing the order of the "ne" and the "not" conditions would lead to the "ne ... not" sentences being coded as "not". In the case at hand, neither of these options is probably the desired one. Rather, it is probably best to distinguish the "ne ... not" cases from the simple "ne" and the simple "not" cases with conditions along the following lines:

3: {
        ne-not:     (IP* iDoms [1]NEG) AND ([1]NEG iDoms ne)
                AND (IP* iDoms [2]NEG) AND ([2]NEG iDoms not)
        ne:         (IP* iDoms NEG) AND (NEG iDoms ne)
        not:        (IP* iDoms NEG) AND (NEG iDoms not)
        pos:        (IP* iDoms !NEG)
        -:          ELSE
   }

In this query, the order of the simple "ne" and simple "not" conditions has no effect on the coding of the "ne ... not" sentences (provided, of course, that the "ne-not" condition precedes the other two).

When numerals, including dates, are used as values in a coding string, they must be escaped with backslash ("\") in the query, as illustrated for column 4 in the sample file. In the coding string, the numerals appear without the backslash.

Sample output

As mentioned earlier, the output file of a coding query contains every token of the input file, with a coding node inserted at every boundary node as the first daughter of that node. Coding nodes have the form:

(CODING-<node_label> <coding_string>)

The "node_label" suffix of the CODING node is the full label of the current instance of the node boundary. For instance, if the node boundary is IP*, and the particular instantiation is IP-SUB-PRN-SPE, then the label for that IP's CODING node is CODING-IP-SUB-PRN-SPE. If a sentence token contains more than one instance of the boundary node, the output token will contain multiple coding nodes.

Here is what an output token resulting from the sample coding query from the previous section would look like:

/~*
They do not know the value of a dollar.
*~/

( (IP-MAT (CODING-IP-MAT -:mat:neg:1)
	  (NP-SBJ (PRO They))
          (DOP do)
          (NEG not)
          (VB know)
          (NP-OB1 (D the)
                  (N value)
		  (PP (P of)
		      (NP (D a)
			  (N dollar))))
	  (. .)))

Searching coding strings

Coding strings may be searched using the column function. For instance, the following query would find the sample coded sentence above:

query:  (CODING-IP-MAT* column 2 mat|sub)

Coding queries can themselves search already existing coding strings. For instance, assume a coded corpus with three columns. Column 1 codes whether the clause-initial constituent is the subject ("subj"), the finite verb ("verb"), or the direct object ("obj"). Columns 2 and 3 code the same information for the second and third constituents in the sentence. The following coding query could then be run on the coded corpus to classify sentences by clause type based on the information in the already existing three columns:

4: {
     svo:     (CODING-IP* column 1 subj) 
          AND (CODING-IP* column 2 verb) 
          AND (CODING-IP* column 3 obj) 
     sov:     (CODING-IP* column 1 subj) 
          AND (CODING-IP* column 2 obj) 
          AND (CODING-IP* column 3 verb) 

     osv: ...
     ovs: ...
     vos: ...
     vso: ...
     -:   ELSE
}

The query for column 4 must be stored in a coding query file distinct from the one for columns 1-3 and run separately. If all four queries are combined in a single coding query file, CorpusSearch will code the input file, but it will assign its default elsewhere value "_" to column 4 throughout, since the input file doesn't contain the CODING nodes referenced by the conditions for column 4.

Extracting coding strings

A command file containing only the following command specification (without a preamble) extracts all of the coding strings in a coded file:

print_only: CODING*

The resulting output file has the extension .ooo.

The trailing asterisk on CODING is necessary because of the suffix on each CODING node that specifies its associated syntactic category. Coding strings for particular categories can be extracted with a "print_only" command that makes specific reference to those categories. For instance:

print_only: CODING-IP*

print_only: CODING-IP-MAT*

In coded files generated by older versions of CorpusSearch (before version 74), coding strings lacked node label suffixes. For such files, the trailing asterisk is not necessary:

print_only: CODING

It is generally useful to append a sentence token's ID information to each extracted coding string. This can be done by including a second command:

print_only: CODING*

add_IDs: t

In the resulting output file, each coding string is followed by an "@" sign, which in turn is followed by the ID information. For instance:

1182:verse:v2:fullNP:transitive@ROLAND,3.102

When exporting the coding string, "@" needs to be added to the list of column delimiters (or it needs to be replaced by ":").

Manually editing coding strings

In general, the coding strings that CorpusSearch generates completely algorithmically are used "as is" as input to further statistical analysis. But in connection with certain research questions, it is desirable to encode information about properties that are not generally annotated in the corpus. In such cases, it is possible to edit the coding strings by hand. For instance, a coding query might be formulated to generate a column for information concerning a noun phrase's discourse status ("old", "new", etc.), filling it with a default elsewhere value ("-"), the statistically most likely value, or whatever value seems most useful. This column can then be reviewed and corrected by hand, and the resulting file can form the input to further CorpusSearch searches or to statistical analysis.

Best practice: In the situation just described, the wisest course is to add the information to (a copy of) the corpus itself. For instance, discourse status could be indicated by adding "-OLD" or "-NEW" to existing NP labels (NP-SBJ-OLD, NP-OB1-NEW, etc.). The work required to add the information manually is the same, whether the information is added to the corpus annotation or to the coding string. The information is safer in the annotation (it can't be accidentally overwritten by re-running the coding query on the corpus), and once it is add there, it is easy to retrieve algorithmically with appropriate coding queries.

Tricks of the trade

CorpusSearch generates columns in the output in numerical order regardless of their order in the query. In other words, the following coding file results in the same output when applied to the same input as when columns 1 and 2 are listed in ascending numerical order.

coding_query:

2: { label(s): condition(s) }
1: { label(s): condition(s) }

The same input file(s) can be coded sequentially with different coding query files and different boundary node choices. For statistical analysis, the coding strings can all be extracted together or separately by individual label.
The two properties just discussed can be exploited in situations where it is desirable to generate certain columns with one choice of boundary node or "ignore" nodes, and other columns with some other choice. In such a situation, the input file can be coded with two (or more) separate coding queries along the following lines. Running coding file 1 results in four columns. CorpusSearch codes Columns 1 and 4 according to the conditions associated with those columns. It generates columns 2 and 3 and fills them with the default value "_" (underscore)
Coding file 1:
```
node: {shared-boundary-node}

ignore_nodes: {ignore-list-1}

coding_query:

1: { label(s): condition(s) }
4: { label(s): condition(s) }
```
The output of coding file 1 can then be coded using coding file 2. (For the purposes under consideration, the two coding files share the same boundary node, but that is not formally necessary.) CorpusSearch now codes columns 2 and 3 according to the specifications for those columns, overwriting the underscore. In addition, it codes a fifth column in the ordinary way.
Coding file 2:
```
node: {shared-boundary-node}

ignore_nodes: {ignore-list-2}

coding_query:

2: { label(s): condition(s) }
3: { label(s): condition(s) }
5: { label(s): condition(s) }
```
Coding a corpus using different boundary nodes provides information about the structures of different syntactic domains. In general, one eventually wants to combine this information under a single CODING node for statistical analysis. This can be done by running a corpus revision query with an appropriate concat command on the coded file(s).

Adding information about external (sociolinguistic) variables to a coding string is particularly simple if each token in the corpus is associated with the appropriate information (as is the case, for instance, in the PCEEC).

// date of composition (for PCEEC, where each token includes metadata)
4: {
        ____:  (LETTER column 3 _*)
      \1410: (LETTER column 3 1410*)
      \1411: (LETTER column 3 1411*)
      \1412: (LETTER column 3 1412*)
      \1413: (LETTER column 3 1413*)
      \1414: (LETTER column 3 1414*)
      \1415: (LETTER column 3 1415*)
      \1416: (LETTER column 3 1416*)
      ....

In other corpora, the information can be imported indirectly by means of reference to information in a token's ID, as in the following examples. Note that numeric codes need to be "escaped" with a backslash.

// date of author's birth
11: {
      \1490:  (ABOTT-E1* inID)
      \1630:  (ALHATTON2-E3* inID)
      \1680:  (ALHATTON-E3* inID)
      \1472:  (AMBASS-E1* inID)
      \1668:  (ANHATTON-E3* inID)
      \1458:  (APLUMPT-E1* inID)
      \1485:  (APOOLE-E1* inID)
      \1568:  (ARMIN-E2* inID)
      \1515:  (ASCH-E1* inID)
      \1632:  (AUNGIER-E3* inID)
      ...
}

// author's sex
13: {
      f: (ABOTT*|ALHATTON*|ANHATTON*|APLUMPT*|APOOLE*|BEHN*|BOETHEL*|DELAPOLE*|DERING*|DPLUMPT*|EBEAUM*|ECUMBERL*|EHATTON*|ELIZ*|EOXINDEN*|EPOOLE*|EVERARD*|FHATTON*|FIENNES*|GREY*|HARLEY*|HOBY*|IPLUMPT*|JACKSON*|JPINNEY*|JUBARRING*|KOXINDEN*|KPASTON*|KSCROPE*|MANNERS*|MASHAM*|MHATTON*|MHOWARD*|MONTAGUE*|MOXINDEN*|MROPER*|MTUDOR*|NEVILL*|PEYTON*|PROUD*|SOUTHARD*|ZOUCH* inID)
      m: ELSE
}