Coding queries are used to create input for multivariate analysis programs like Varbrul; general statistical programming environments like S, Splus, and R; and statistical analysis packages like Datadesk, JMP, SAS, and SPSS. They are extremely powerful and flexible tools. In particular, they allow us to circumvent the difficulties inherent in using OR and NOT.
Coding query files must have the extension .c (preferred) or .q (deprecated). By default, their output has the same basename as the coding query file, but with the extension .cod. The basename of the output file (but not the .cod extension) can be changed with the same the "-out" switch as for ordinary query files (see Installing CorpusSearch).
CS foo.c bar.psd ← default output = foo.cod CS foo.c bar.psd -out bar.cod
The command file for a coding query follows exactly the same conventions as for an ordinary search query, except that the command specification is "coding_query" (rather than "query"). In addition, the contents of the command specification essentially consist of multiple ordinary queries, as described in more detail below. Obligatory components are underlined.
Here is an example:
- Preamble
- Node specification
- "Ignore" commands
- Output format commands
- Reference to definition file
- Command specification
The output file of a coding query contains every token of the input file, with a coding node inserted at every boundary node as the first daughter of that node.// node specification node: IP-MAT* // "ignore" command ignore_nodes: null // output specification print_indices: t // definition file define: mideng.def // command specification - classify sentences by length coding_query: 1: { short: (IP-MAT* iDomsTotal< 10) medium: (IP-MAT* iDomsTotal< 20) long: ELSE }
Including |
coding_query: column_number: { label-1: condition(s) label-2: condition(s) . . . }
Here is a sample coding file that generates a coding string with four columns.
node: IP-* coding_query: 1: { spe: (IP-*SPE* iDoms NP-SBJ*) -: ELSE } 2: { mat: (IP-MAT* iDoms NP-SBJ*) sub: (IP-SUB* iDoms NP-SBJ*) inf: (IP-INF* iDoms NP-SBJ*) -: ELSE } 3: { neg: (IP* iDoms NEG) pos: (IP* iDoms !NEG) -: ELSE } 4: { \1: (NP-SBJ* domsWords 1) \2: (NP-SBJ* domsWords 2) \3: (NP-SBJ* domsWords> 2) \0: ELSE }
In this example, column 1 of the coding string will contain "spe" if there is an IP-*SPE* that immediately dominates a subject. If not, the column will contain "-", as specified by the ELSE condition (used only in coding queries). In the absence of an explicit ELSE condition, if none of the stated conditions for the column are met, CorpusSearch fills the column by default with "_" (underscore).
The conditions for each column are evaluated in order. As soon as a condition is met, the appropriate value is added to the coding string, and CorpusSearch moves on to the next column. Depending on the input data and the logical relationships among the conditions being evaluated, the order in which they appear in the query can have an effect on the coding string. For instance, if sentences can contain both "ne" and "not", then the following conditions will code "ne ... not" sentences as "ne".
Reversing the order of the "ne" and the "not" conditions would lead to the "ne ... not" sentences being coded as "not". In the case at hand, neither of these options is probably the desired one. Rather, it is probably best to distinguish the "ne ... not" cases from the simple "ne" and the simple "not" cases with conditions along the following lines:3: { ne: (IP* iDoms NEG) AND (NEG iDoms ne) not: (IP* iDoms NEG) AND (NEG iDoms not) pos: (IP* iDoms !NEG) -: ELSE }
In this query, the order of the simple "ne" and simple "not" conditions has no effect on the coding of the "ne ... not" sentences (provided, of course, that the "ne-not" condition precedes the other two).3: { ne-not: (IP* iDoms [1]NEG) AND ([1]NEG iDoms ne) AND (IP* iDoms [2]NEG) AND ([2]NEG iDoms not) ne: (IP* iDoms NEG) AND (NEG iDoms ne) not: (IP* iDoms NEG) AND (NEG iDoms not) pos: (IP* iDoms !NEG) -: ELSE }
When numerals, including dates, are used as values in a coding string,
they must be escaped with backslash ("\") in the query, as illustrated for
column 4 in the sample file. In the coding string, the numerals appear
without the backslash.
Here is what an output token resulting from the sample coding query from
the previous section would look like:
Sample output
As mentioned earlier, the output file of a coding query contains every
token of the input file, with a coding node inserted at every boundary
node as the first daughter of that node. Coding nodes have the form:
The "node_label" suffix of the CODING node is the full label of the
current instance of the node boundary. For instance, if the node
boundary is IP*, and the particular instantiation is IP-SUB-PRN-SPE,
then the label for that IP's CODING node is CODING-IP-SUB-PRN-SPE. If a
sentence token contains more than one instance of the boundary node, the
output token will contain multiple coding nodes.
(CODING-<node_label> <coding_string>)
/~*
They do not know the value of a dollar.
*~/
( (IP-MAT (CODING-IP-MAT -:mat:neg:1)
(NP-SBJ (PRO They))
(DOP do)
(NEG not)
(VB know)
(NP-OB1 (D the)
(N value)
(PP (P of)
(NP (D a)
(N dollar))))
(. .)))
Searching coding strings
Coding strings may be searched using the
column function.
For instance, the following query would find the sample coded sentence
above:
Coding queries can themselves search already existing coding strings. For instance, assume a coded corpus with three columns. Column 1 codes whether the clause-initial constituent is the subject ("subj"), the finite verb ("verb"), or the direct object ("obj"). Columns 2 and 3 code the same information for the second and third constituents in the sentence. The following coding query could then be run on the coded corpus to classify sentences by clause type based on the information in the already existing three columns:query: (CODING-IP-MAT* column 2 mat|sub)
The query for column 4 must be stored in a coding query file distinct from the one for columns 1-3 and run separately. If all four queries are combined in a single coding query file, CorpusSearch will code the input file, but it will assign its default elsewhere value "_" to column 4 throughout, since the input file doesn't contain the CODING nodes referenced by the conditions for column 4.4: { svo: (CODING-IP* column 1 subj) AND (CODING-IP* column 2 verb) AND (CODING-IP* column 3 obj) sov: (CODING-IP* column 1 subj) AND (CODING-IP* column 2 obj) AND (CODING-IP* column 3 verb) osv: ... ovs: ... vos: ... vso: ... -: ELSE }
print_only: CODING*
The resulting output file has the extension .ooo.
The trailing asterisk on CODING is necessary because of the suffix on each CODING node that specifies its associated syntactic category. Coding strings for particular categories can be extracted with a "print_only" command that makes specific reference to those categories. For instance:
print_only: CODING-IP* print_only: CODING-IP-MAT*
In coded files generated by older versions of CorpusSearch (before version 74), coding strings lacked node label suffixes. For such files, the trailing asterisk is not necessary:
print_only: CODING
It is generally useful to append a sentence token's ID information to each extracted coding string. This can be done by including a second command:
In the resulting output file, each coding string is followed by an "@" sign, which in turn is followed by the ID information. For instance:print_only: CODING* add_IDs: t
1182:verse:v2:fullNP:transitive@ROLAND,3.102
When exporting the coding string, "@" needs to be added to the list of column delimiters (or it needs to be replaced by ":"). |
Best practice: In the situation just described, the wisest course
is to add the information to (a copy of) the corpus itself. For
instance, discourse status could be indicated by adding "-OLD" or "-NEW"
to existing NP labels (NP-SBJ-OLD, NP-OB1-NEW, etc.). The work required
to add the information manually is the same, whether the information is
added to the corpus annotation or to the coding string. The information
is safer in the annotation (it can't be accidentally overwritten by
re-running the coding query on the corpus), and once it is add there, it
is easy to retrieve algorithmically with appropriate coding queries.
Coding file 1:
Coding file 2:
In other corpora, the information can be imported indirectly by means of
reference to information in a token's ID, as in the following examples.
Note that numeric codes need to be "escaped" with a backslash.
Tricks of the trade
coding_query:
2: { label(s): condition(s) }
1: { label(s): condition(s) }
The output of coding file 1 can then be coded using coding file 2. (For
the purposes under consideration, the two coding files share the same
boundary node, but that is not formally necessary.) CorpusSearch now
codes columns 2 and 3 according to the specifications for those columns,
overwriting the underscore. In addition, it codes a fifth column in the
ordinary way.
node: {shared-boundary-node}
ignore_nodes: {ignore-list-1}
coding_query:
1: { label(s): condition(s) }
4: { label(s): condition(s) }
node: {shared-boundary-node}
ignore_nodes: {ignore-list-2}
coding_query:
2: { label(s): condition(s) }
3: { label(s): condition(s) }
5: { label(s): condition(s) }
// date of composition (for PCEEC, where each token includes metadata)
4: {
____: (LETTER column 3 _*)
\1410: (LETTER column 3 1410*)
\1411: (LETTER column 3 1411*)
\1412: (LETTER column 3 1412*)
\1413: (LETTER column 3 1413*)
\1414: (LETTER column 3 1414*)
\1415: (LETTER column 3 1415*)
\1416: (LETTER column 3 1416*)
....
// date of author's birth
11: {
\1490: (ABOTT-E1* inID)
\1630: (ALHATTON2-E3* inID)
\1680: (ALHATTON-E3* inID)
\1472: (AMBASS-E1* inID)
\1668: (ANHATTON-E3* inID)
\1458: (APLUMPT-E1* inID)
\1485: (APOOLE-E1* inID)
\1568: (ARMIN-E2* inID)
\1515: (ASCH-E1* inID)
\1632: (AUNGIER-E3* inID)
...
}
// author's sex
13: {
f: (ABOTT*|ALHATTON*|ANHATTON*|APLUMPT*|APOOLE*|BEHN*|BOETHEL*|DELAPOLE*|DERING*|DPLUMPT*|EBEAUM*|ECUMBERL*|EHATTON*|ELIZ*|EOXINDEN*|EPOOLE*|EVERARD*|FHATTON*|FIENNES*|GREY*|HARLEY*|HOBY*|IPLUMPT*|JACKSON*|JPINNEY*|JUBARRING*|KOXINDEN*|KPASTON*|KSCROPE*|MANNERS*|MASHAM*|MHATTON*|MHOWARD*|MONTAGUE*|MOXINDEN*|MROPER*|MTUDOR*|NEVILL*|PEYTON*|PROUD*|SOUTHARD*|ZOUCH* inID)
m: ELSE
}