Definition files


What do definition files do?

From a purely formal point of view, definition files are an optional convenience. But in connection with using CorpusSearch in the context of real-life research projects, it is best to think of definition files as a resource that should be exploited whenever possible.

In the course of doing research on, say, the verbal syntax of English, you might find yourself making reference to finite verbs over and over again. You might start out with a disjunction like this:

BE[DP]|DO[DP]|HV[DP]|MD|VB[DP]

Then you realize that given the complexity of the annotation system you're using, you need to include versions of the items above with prefixed particles or other material.

*+BE[DP]|*+DO[DP]|*+HV[DP]|*+MD|*+VB[DP]

So now your list of search terms includes both the simple and the prefixed variants.

BE[DP]|DO[DP]|HV[DP]|MD|VB[DP]|*+BE[DP]|*+DO[DP]|*+HV[DP]|*+MD|*+VB[DP]

The items on the list have to be separated by a single pipe symbol ("|") to be interpreted correctly. Two adjacent pipe symbols will cause CorpusSearch to crash, but the inadvertent absence of a pipe symbol is more problematic in that CorpusSearch will run, but just not give you the correct results. For instance, the following disjunction will fail to find any instances of simple finite BE or finite DO. Can you see why? (Click here for answer.)

BE[DP]DO[DP]|HV[DP]|MD|VB[DP]|*+BE[DP]|*+DO[DP]|*+HV[DP]|*+MD|*+VB[DP]
If you have several queries that are intended to refer to the same search term, it can become a job in itself to ensure that those terms are defined consistently across all your queries. Wouldn't it be nice if there were a single place to define all your search terms, so that when you make revisions, they affect all queries making reference to those terms in a uniform way?

There is such a place. It's called a definition file.

Content of definition files

A definition file is simply a file containing the various labels that you want to group together, together with a definition (an abbreviation or alias) that allows you to make reference to them. Here's an example:

// definitions for Middle English

finite_verb:    BE[DP]|DO[DP]|HV[DP]|MD|VB[DP]|*+BE[DP]|*+DO[DP]|*+HV[DP]|*+MD|*+VB[DP]
nonfinite_verb: BE|DO|HV|MD0|VB|*+BE|*+DO|*+HV|*+MD0|*+VB
The definition on the left must be an orthographic word (that is, not contain spaces). It is followed by a colon, and then the list of labels that it stands for. Each definition must be associated with a list unique to it. So the following definition file is not good, because the definition "finite_verb" is ambiguous:
finite_verb:  BE[DP]
finite_verb:  VB[DP]
However, a list on the right can be associated with more than one definition, as in the following example:
obj:     NP-OB1*|NP-OB2*
object:  NP-OB1*|NP-OB2*
This can be useful, as discussed further in
Recursive definitions.

Calling definition files

Definition files must have the extension .def, and they must be stored in the same directory as the command file. They are called by including a line like the following in a query file.
define: ppche.def

An entire query might then read as follows:

node:   IP*

define: ppche.def

query:     (finite_verb hasSister nonfinite_verb)
       AND (finite_verb precedes nonfinite_verb)

The "define" command instructs CorpusSearch where to find the definitions for "finite_verb" and "nonfinite_verb". Without this command, "finite_verb" and "nonfinite_verb" would be read as literal strings and not be replaced by their definitions. Most likely, this would result in CorpusSearch reporting no hits for the search (since most parsed corpora will not contain instances of the literal strings "finite_verb" or "nonfinite_verb").

It is also possible to call a definition file from your preference file. As usual, specifications in a query file override corresponding specifications in a preference file.

A directory must not contain more than one preference file, but it can contain multiple definition files. This is very useful in connection with running the same queries on corpora with different annotation labels; see Some reasons to use definition files.

Reference to definition files in output files

The output from a search that calls a definition file will include a preface along the following lines:
/*
    PREFACE:
    CorpusSearch copyright Beth Randall 2000.
    Date:  Thu Apr 13 08:57:07 EDT 2000

    command file:       search.q
    output file:        search.out

    definition file:  ppche.def
    node:   IP*
    query:  (BE[DP]|DO[DP]|HV[DP]|MD|VB[DP]|*+BE[DP]|*+DO[DP]|*+HV[DP]|*+MD|*+VB[DP] precedes BE|DO|HV|MD0|VB|*+BE|*+DO|*+HV|*+MD0|*+VB)
*/
The entry under "definition file" gives the name of the definition file that CorpusSearch used when running the query in search.q. Since you might have made changes to the file between running the query and reviewing the output of the search, CorpusSearch also reports the actual query that it ran (that is, the query that resulted when it expanded the definition at runtime). This can be helpful in troubleshooting.

If a particular search reports no hits or suspiciously low numbers of hits for a particular search term, this is likely due to:

Recursive definitions

Definitions may be recursive, allowing complex definitions to be built up out of more basic ones. For instance:
finite_verb_simple:  BE[DP]|DO[DP]|HV[DP]|MD|VB[DP]
finite_verb_complex: *+BE[DP]|*+DO[DP]|*+HV[DP]|*+MD|*+VB[DP]
finite_verb:         $finite_verb_simple|$finite_verb_complex

If you can never remember whether you refer to finite verbs in your queries as "finite_verb" or "Vfin", you can use recursive definitions to render the question irrelevant by including lines like the following (best right after the non-recursive, basic definition, as shown here):

finite_verb:          $finite_verb_simple|$finite_verb_complex
Vfin:                 $finite_verb

Some reasons to use definition files

As mentioned at the outset, definition files are an extremely useful tool, powerful and flexible at once, and we urge CorpusSearch users to use them whenever the search terms in their queries becomes even the slightest bit complex.

As we mentioned, they offer a powerful assist in enforcing consistency across queries of all sorts (whether ordinary, revision, or coding queries). Any revisions that you make to your search terms are made only once - in the definition file.

Definition files can greatly facilitate comparative searches across corpora from various languages or linguistic stages that use different annotation labels for the same (or very similar) linguistic concepts. For instance, in conducting research on Old English and later stages of English using the York corpora of Old English and the Penn Parsed Historical Corpora of English, one can set up distinct definition files because of the divergent annotation guidelines, but use query files that are identical in every respect but the "define" line.

// old-english.def

subj:        NP-NOM*
dir-obj:     NP-ACC*
indir-obj:   NP-DTV*
finite_be:   BE[DP]|BE[DP][IS]|BEPH


// later-english.def subj: NP-SBJ* dir-obj: NP-OB1* indir-obj: NP-OB2* finite_be: BE[DP]
// sample-search-old-english.q node: IP* define: oe.def query: (subj hasSister finite_be) AND (subj iPrecedes finite_be)
// sample-search-later-english.q same as previous except for the "define" line, which would read "define: later-english.def".

Definition files can be used as a "poor person's lemmatizer" and "poor person's verb classifier" along the following lines. (The entries are very simplified; there are many more spelling variants to be considered in historical texts.)

give:  [gG][eiy][uv]e
gave:  [gG]ave
given: [gG][eiy][uv]en
send:  [sS]end|[sS]ende
sent:  [sS]ent|[sS]ente
GIVE:  $give|$gave|$given
SEND:  $send|$sent
double-object-verb: $GIVE|$SEND
In conjunction with
corpus revision queries, these entries could be used to associate lemmas with verb forms. The following revision query gives outputs in the style of IcePAHC.
query: (V* iDoms {1}GIVE)

append_label{1}: =give


Answer to Can you see why?: There's a pipe symbol missing after BE[DP]. Unlike in the correct query with the pipe symbol, the query with the error instructs CorpusSearch to search for the expression "BE[DP]DO[DP]", which expands to BEDDOD, BEPDOD, BEPDOD, and BEPDOP. None of these labels exist in the corpus.