Troubleshooting


This section gives tips on troubleshooting a number of common problems and errors that arise when using CorpusSearch. The reader is assumed to have a general familiarity with the rest of the CorpusSearch manual. In some of the example queries, we make reference to a following definition file with the following definition:
finite_verb:  BE[DP]|DO[DP]|HV[DP]|VB[DP]|MD

Adjacent pipe symbols

Adjacent pipe symbols (||) in a query will crash CorpusSearch. The adjacent pipe symbols may be in a definition file called by the query, rather than in the query itself. The error message will begin like this:
ERROR!  In Meat.CrankThrough:  
Exception:  String index out of range: 0
String index out of range: 0
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
java.lang.StringIndexOutOfBoundsException: String index out of range: 0

Confusing (immediate) precedence and sisterhood

As mentioned in the entries for
precedes and iPrecedes on the one hand and hasSister on the other, (immediate) precedence does not imply sisterhood, or vice versa. See the entries for the commands for examples. The two relations must be specified separately.

"Escaped" vs. bare asterisks

In connection with searching for expressions enclosed in asterisks like traces and empty categories, it is necessary to carefully distinguish between "\*" and "*". The first expression (the "escaped" asterisk) matches an asterisk character in the input, whereas the second expression matches any string (including the null string).

regular expression matches input string does not match input string
\*T\* *T* *T*-1, *T239, T, The, ATE, VAT
\*T\** *T*, *T*-1 *T239, T, The, ATE, VAT
\*T* *T*, *T*-1, *T239 T, The, ATE, VAT
*T* *T*, *T*-1, *T239, T, The, ATE, VAT  

Ill-formed input

Preliminary: Redirecting error output

When CorpusSearch encounters ill-formed input, it stops writing to the output file and starts writing to standard output, which by default is the terminal screen. When debugging, particularly in connection with
mismatched parens, it is generally necessary to redirect the output to a named error file (say, "ERR") by a command along the following lines.
CS query.q file.psd >& ERR

Mismatched parens

The most seriously ill-formed input involves mismatched parens, which generally arise from editing a parsed file with an ordinary text editor like emacs rather than with Annotald or a similar program.

Missing wrapper parens

Wrapper parens are the unlabeled parens that delimit each token in Penn Treebank format, as shown in (1a). If the wrapper parens are missing, the token looks like (1b), and you will have to add the relevant parens in order to meet the CorpusSearch's compatibility requirements.
(1) a.  ( (IP-MAT ... ))

    b.    (IP-MAT ... )

Other ill-formed input sequences

CorpusSearch helpfully breaks on various other ill-formed input sequences like the following:
(2) a.  ( (IP-MAT (NP-SBJ (PRO$ My) (N neighbor))
                  (VBD told)
                  me                                    ← bare word (missing preterminal)
          (. .)))

    b.  ( (IP-MAT (NP-SBJ (PRO$ My) (N neighbor))
                  (VBD told)
                  (NP-OB2 (PRO m e))			← terminal contains space
          (. .)))

    c.  ( (IP-MAT (NP-SBJ (PRO$ My                      ← preterminal isn't unary-branching
                                (N neighbor)))
                  (VBD told)
                  (NP-OB2 (PRO me)))
          (. .))

Missing asterisk

In general, it is wise to be liberal with asterisks and to omit them only if you are sure that you don't need them. For instance, "NP-SBJ" as a search term finds only a subset of subjects. It does not match subjects that are resumptive ("NP-SBJ-RSP"), coindexed with a clause ("NP-SBJ-1"), or with any other additional material after the dash tag. Using "NP-SBJ*" will find all subjects, no matter what might be added on to the end of the label.

When you want to refer to all the variants of a label except for one or two, you have to explicitly list all the desired variants. For instance, if you are interested in all instances of "ADVP*" except for "ADVP-DIR", you must use a disjunction like the following:

ADVP|ADVP-LOC*|ADVP-TMP*

In such cases, best practice is to include the disjunction in a definition file.

Missing "define"

A common error is to intend to use a definition file, but to omit the requisite "define" command in the preamble. CorpusSearch issues no warning message, but the search will not yield the intended output because CorpusSearch interprets the strings intended as definitions as literal strings, which generally do not match anything in the input.

For more details and suggestions for troubleshooting, see Definition file.

Missing prefix indices

A very common error is to forget to add prefix indices to arguments of a search function (in other words, to unintentionally impose same-instance). This is the chief cause of a baffling absence of hits. Here is an example of a query intended to find clauses in which both the subject and the object are prounous.
query:    (NP-SBJ* iDoms PRO)
      AND (NP-OB1* iDoms PRO)

This query can never return any hits, because the two instances of PRO are interpreted by default as referring to the same node. But that configuration is not a possible tree structure, as one and the same node cannot be simultaneously dominated by the subject and the object. The two instances of PRO must be distinguished with prefix indices:

query:    (NP-SBJ* iDoms [1]PRO)
      AND (NP-OB1* iDoms [2]PRO)

Missing same-instance

Errors due by
missing prefix indices are cases of unintentionally overusing same-instance. Same-instance can also be underused, as in the following query, intended to retrieve instances of V2 with clause-initial adverb phrases.
query:    (IP* iDomsFirst ADVP*)
      AND (finite_verb iPrecedes NP-SBJ*)

The intended instances will be retrieved, but so will unintended tokens like the following, where the finite verb and the subject are not clausemates of the adverb phrase, as clearly indicated by the node indices in the result block.

/*
1 IP-MAT:  1 IP-MAT, 2 ADVP-TMP, 16 BEP, 18 NP-SBJ
*/
(0  (1 IP-MAT (2 ADVP-TMP (3 ADV Yesterday))
	      (5 NP-SBJ (6 PRO they))
	      (8 VBD asked)
	      (10 , ,)
	      (12 " ")
	      (14 CP-QUE-MAT-SPE (15 IP-SUB-SPE (16 BEP Are)
						(18 NP-SBJ (19 PRO you))
						(21 VAG coming)
						(23 PP (24 P with)
						       (26 NP (27 PRO us)))))
	      (29 . ?)
	      (31 " ")))
The reason for this error is that the query fails to impose the clausemate condition. The solution is to "tie" the constituents in the second clause of the query to one or more constituents in the first clause of the query by exploiting same-instance. Here is one way of doing that:
query:    (IP* iDomsNumber 1 ADVP*)
      AND (IP* iDomsNumber 2 finite_verb)
      AND (IP* iDomsNumber 3 NP-SBJ*)

Here is another way:

query:    (IP* iDomsFirst ADVP*)
      AND (ADVP* hasSister finite_verb)
      AND (finite_verb hasSister NP-SBJ*)
      AND (ADVP* iPrecedes finite_verb)
      AND (finite_verb iPrecedes NP-SBJ*)

Redundant "exists"

The following query is not ill-formed, but it is inefficient.

query:     (NP-SBJ* exists) 
       AND (ADJ exists)
       AND (NP-SBJ* iDoms ADJ)

The final clause in the query implies the two preceding ones, and the same effect can therefore be obtained more simply with:

query:     (NP-SBJ* iDoms ADJ)

Stalled search

Before running a CorpusSearch job in the background, make sure there is no .out file corresponding to the query you intend to run. CorpusSearch ordinarily warns you that it is about to overwrite the existing .out file and prompts you to allow the search to go forward. But if you background the job, the prompt is backgrounded too and you never get a chance to permit the search. As a result, the command stalls and will have to be killed.

The above scenario is a common cause of searches that take an unexpectedly long time. (This can be confirmed with Unix/Linux's "jobs" command.)