What Is CorpusSearch?

contents of this chapter:

what is CorpusSearch?
input to CorpusSearch

source file(s)
command file

output of CorpusSearch

output file
complement file

what is CorpusSearch?

CorpusSearch is a search program that searches for linguistic structures in a corpus of parsed, labelled sentences.

input to CorpusSearch

CorpusSearch needs two pieces of information:

what sentences to search (source file(s)).
what structures to search for (command file).

A source file is any file that contains parsed, labelled sentences. This could be a file from the Middle English (or other) corpus, an output file from a previous search, or perhaps a file of sentences that the user has cut and pasted together. Any number of source files can be searched with one run of CorpusSearch.

command file

The command file contains a query, which describes the structures being searched for, and possibly additional material. This additional material may specify the node boundaries in which to search, and may choose various options for printing the output.

output of CorpusSearch

CorpusSearch always prints a standard output file, and optionally, will print a complement file.

output file

The output file contains the sentences that were found to contain the searched-for structure, along with comments describing where the structures were found. Statistics are kept detailing the number of distinct boundary nodes containing the structure, or "hits", the number of sentences containing the hits, and the total number of sentences in the file. Notice that the number of hits may change depending on the definition of the boundary node.

complement file

A complement file is produced if the command file contains this line:

print_complement: true

The complement file, if there is one, contains all the sentences in the source file that do not contain the searched-for structure. The output file and complement file are complementary sets that together contain all the sentences in the source file.

A First Search on babel
Table of Contents