Basic concepts

What is CorpusSearch?

The basic purpose of CorpusSearch is to find linguistic structures in a corpus of parsed sentences. It also has several other features, including support for:

automatic creation of coding strings for statistical analysis
automatic corpus revision
automatic creation of local "frames" for words in a corpus
automatic creation of a lexicon for a corpus

Input to CorpusSearch

CorpusSearch needs two pieces of information:

one or more source files - a corpus of sentences to search
a command file - a specification of what structures to search for

Source file(s)

A source file is any file of parsed sentences that satisfies CorpusSearch's compatibility requirements. This could be a file from the Penn Parsed Corpora of Historical English or from another parsed corpus, or perhaps a file of sentences that the user has cut and pasted together.

CorpusSearch allows output files from previous searches to serve as source files for subsequent searches.

Multiple source files can be searched in a single run.

Command file

The command file minimally contains a query, which specifies the structures being searched for, and a boundary node, which specifies the syntactic domain within which the search is to take place. Beyond that, the command file can also include further specifications, mainly concerning the format of the output.

Output of CorpusSearch

Ordinary search output

An ordinary output file contains the structures from the source file(s) that match the specifications in the search query. It is possible to include information pinpointing where the structures were found, which is useful in the case of very long sentences. Statistics are kept detailing the number of structures matching the search query ("hits"), the number of sentence tokens ("tokens") containing hits, and the total number of tokens in the file.

The number of hits may change depending on the definition of the boundary node.

Coding output

CorpusSearch can add coding strings to a corpus that are suitable as input to multivariate statistical analysis. Each coding string consists of columns associated with various properties of interest to the researcher, and the values for the columns are generated automatically according to specifications in a special type of query known as a coding query.

For instance, the value in the first column of a coding string might encode the syntactic category of a sentence's first constituent. The second and third columns might encode the same information for the sentence's second and third constituents, respectively. The information from all three columns could then be used to calculate the frequency of basic word order patterns in the corpus ("SVO", "SOV", etc.). (In principle, the same statistics could also be obtained from the output of multiple ordinary search queries, but that process would be much more laborious and prone to human error.)

Corpus revision output

CorpusSearch can produce a copy of a corpus in which certain structures are automatically revised according to user specifications. This feature can be used in order to:

apply global changes in annotation guidelines to an entire corpus
automatically correct systematic annotation errors, or at least flag possible errors for human review
build parsed corpora from POS-tagged corpora from scratch (for instance, in situations where no parsed training data exist)

Frames output

CorpusSearch can generate the set of local frames for given words. These frames are defined as the syntactic sisters of the POS tag of the word in question. This might be helpful in constructing word classes - for instance, in comparing the distribution of double-object verbs (The children gave their parents a present) and double-complement verbs (The children gave a present to their parents).

Lexicon output

CorpusSearch can generate a lexicon for a corpus. The output is a list of every word in the corpus along with the number of times it occurs under each POS tag that it can have.