Coding

contents of this chapter:

what is coding?
the York project
coding file example
output file example
how to search coding nodes
just the codes
the York project

what is coding?

Coding is used for creating input to multivariate analysis programs like varbrul or datadesk. If you're not using such a program, you don't need to read this chapter.

The development of the coding portion of CorpusSearch has been funded under a grant from the English Arts and Humanities Research Board to Anthony Warner and Susan Pintzuk at the University of York, England.

coding file example

Here's an example of a basic coding file, written by Ann Taylor. It's called "obj.c". All coding file names must end with ".c". If you don't understand the first line, "define: Ann.def", see definition files. This is just the beginning of the file: it goes on to describe 9 columns, but to save space I'm only showing the first 3:

define: Ann.def

1: {
        s: (IP-SPE* iDoms NP-OB*)
        n: ELSE
   }

2: {
        m: (IP-MAT* iDoms NP-OB*)
        s: (IP-SUB* iDoms NP-OB*)
        i: (IP-INF* iDoms NP-OB*)
        e: ELSE

   }

3: {
        t: ((IP* iDoms NEG)
          AND (NEG iDoms !ne))
        p: (IP* iDoms !NEG)
        n: ELSE
   }

In general, coding files have this form:

column_number: {
	label: condition
	label: condition 
	.
	.
	.
	}

So, in the example above, column 1 of the coding node will contain an "s" if IP-SPE* iDoms NP-OB*. Otherwise, the column will contain an "n".

Coding files are used instead of query files. So, to code a file, use this command:

java CorpusSearch <coding_file> <file_to_code>

output file example

Output files resulting from coding are labelled ".cod". They contain every sentence or node of the input file, with coding nodes inserted. Here's a sentence from the output file resulting from the above coding file:

/~*
knewe kyndes & complexciones of men & of bestus
(CMHORSES,85.2)
*~/


(0 NODE (0 CODING n:s:p)
        (1 IP-SUB (2 NP-SBJ *T*-1)
                  (3 VBD knewe)
                  (4 NP-OB1 (5 NS kyndes)
                            (6 CONJ &)
                            (7 NS complexciones)
                            (8 PP
                                  (9 PP (10 P of)
                                        (11 NP (12 NS men)))
                                  (13 CONJP (14 CONJ &)
                                            (15 PP (16 P of)
                                                   (17 NP (18 NS bestus)))))))
        (19 ID CMHORSES,85.2))

The coding node occupies a position like that of the ID node: it is outside of the parsed sentence but inside the "wrapper", the extra set of parentheses surrounding the sentence or node.

how to search coding nodes

Coding nodes may be searched using column. For instance, to find all sentences whose coding node contains "m" or "p" in the 7th column, use this query:

query:  (CODING column7 m|p)

just the codes

Susan Pintzuk has written the following perl script to extract the coding information from CorpusSearch output:

#!/usr/local/bin/perl

#Usage: make_cs inputfile > outputfile
#this script takes a coded CorpusSearch file and outputs
#only the coding strings in the following format:
# (f:f:f:f:
# (f:f:f:f:
#the outputfile should then be imported to a word processor
#and the colons removed for varbrul or replaced by
#tabs for datadesk


while (<>) {
        if (/\(\d CODING/) {
                /CODING\s([^\)]+)\)/;
                $string = $1;
                print "($string\n";
        }
}

Search Tips
Table of Contents