Tips for corpus builders


Introduction

We would be remiss if we did not point out the obvious: Make sure to have a functioning backup system!

It is wise to distinguish between a private (work-in-progress) version of the corpus and a public release version. The public release version should conform to the relevant annotation guidelines (apart from known issues). The private version can (and in general does) contain more information than is available in the public version. In particular, it can contain provisional distinctions that facilitate the correction phase and that are collapsed in the public version. It can also serve as a testbed for annotation ideas and implement those ideas inconsistently or otherwise imperfectly. In understanding the relation between the private and the public version, it is important to recognize that, as we know from historical linguistics, mergers are irreversible. Once a distinction is merged in the private version, it is not recoverable. Collapsing distinctions and losing information in the private version is therefore not a step to take lightly, since restoring the information would by definition require duplicating the human effort involved in annotating the distinction the first time around.

For the foreseeable future, human correction of parsed corpora is necessary. However, compared to automatic processing, it is both time-consuming and error-prone. When building a corpus, it is therefore wise to be on the lookout for ways to add information at early stages of processing in order to inform later stages and to minimize human case-by-case review at those later stages.

When adding information, it is extremely useful to think in terms of normalization in the mathematical sense. Normalized structures are ones where conceptually similar cases are treated in a uniform and exceptionless way. Though normalized structures may diverge from standard orthographic convention, they are well worth introducing since they are conceptually and computationally so much more tractable than their non-normalized counterparts.

Using punctuation to guide the parser

In audio-aligned corpora, punctuation is not part of the underlying audio signal and must be added. In addition to its obvious purpose of providing conventional transcripts for human consumption, the process of adding punctuation to the text (henceforth, "punctuating") is fertile ground for adding information of the type just mentioned. Conventions concerning punctuation can greatly facilitate or inhibit downstream stages of processing and correcting, and it is therefore sensible and efficient to introduce conventions that diverge from standard orthographic practice when they result in normalized representations that are not completely outlandish-looking. By way of example, we discuss question marks and quotation marks below.

Question marks. In a series of questions that would standardly form a single orthographic sentence, standard orthographic practice would generally delimit the constituent questions (except for the last) using commas. It is also possible to delimit each individual constituent question using a question mark. The latter option is the normalized form (and in this particular case contains more information, though that is not always true of normalized forms), and it is therefore the one we implement.

( (CP-QUE-MAT (WNP-1 (WPRO What))
              (IP-SUB (MD would)
	              (NP-SBJ (PRO you))
		      (VP (VB like)
			  (NP-OB1 *T*-1)))
	      (PUNC ?)))				← like this; not COMMA or PERIOD

( (CP-QUE-MAT (CONJ And)
	      (WADVP-1 (WADV when))
	      (IP-SUB (MD would)
		      (NP-SBJ (PRO you))
		      (VP (VB like)
			  (IP-ECM (NP-SBJ (PRO it))
				  (VP (ADVP-TMP *T*-1)
				      (VAN delivered)))))
	      (PUNC ?)))

In connection with dangling conjunctions followed by BREAK, provisional token-internal question marks should be used to guide the parser in postulating a root CP-QUE-MAT. The token-internal question mark can be replaced by a comma in the published version of the corpus if desired.

( (CP-QUE-MAT (IP-SUB (BEP Are)
		      (NP-SBJ (PRO they))
		      (VP (VAG coming)))
	      (PUNC ?)
	      (CONJ or)
	      (CODE <BREAK>)))

Quotation marks. Another example concerns the addition of quotation marks in places where standard (modern) orthographic practice does not have them. See Double quotes for the annotation convention and Quotations below for considerations of implementation. Other ways that punctuation can be used to inform later stages of the annotation, not necessarily related to normalization, are discussed below.

Information can also be added at the tagging stage. Here are two examples:

In principle, empty terminals can be added at the punctuating stage, but the tagging stage is probably more sensible from an ergonomic point of view.

Marking constructions during corpus building

Compound words

Since compound word formation is productive, it is not really possible to list all
compound words in a script to ensure that they are surrounded by the properly labelled brackets in the parsed structure. Moreover, it would be extremely useful to have compound words already identified as such in the input to the parser. A simple solution is to use a designated symbol - say, CARET (^) - in order to join words temporarily, as illustrated in the flowchart below:

Elaborations and clause-adjoined constituents

COLON (:) can be used to demarcate elaborations (ELAB) or clause-adjoined (-CAR) constituents. Depending on each corpus's conventions, COLON is either retained in the public version or replaced by COMMA.
( (NP (D a)
      (N problem)
      (PUNC :)
      (ELAB (NP (NP-POS (PRO$ her))
		(N inability)
		(IP-INF (TO to)
			(VP (VB play)
			    (NP-OB1 (N tennis))))))))

( (IP-MAT (NP-CAR (D a)
		  (ADJP (ADJ good))
		  (N example))
	  (PUNC :)
	  (NP-SBJ (QP (Q some))
		  (N people))
	  (VP (MD do@)
	      (NEG @n't)
	      (VP (VB own)
		  (NP-OB1 (D a)
			  (N car))))
	  (PUNC .)))

Elided forms

APOSTROPHE can be used to mark elided forms, notably word-initial truncation. This can be useful to distinguish an elided form like 'MOST (< ALMOST) from the otherwise homonymous quantifier MOST and to assist the tagger in giving them the proper tags (ADV and QS, respectively). Depending on each project's transcription conventions, the elided forms may be replaced by the full forms in the public corpus, or the elided forms with APOSTROPHE may be retained.

Integrated adnominal IP-MAT

SEMI-COLON (;) is used to delimit integrated adnominal IP-MATs. Once it has served its function, it is replaced by COMMA.

Quotations (QTP)

In the parsed structure, direct speech is enclosed in QTP brackets. Since current parsers are unable to tell whether a particular sequence is direct speech or not, it is useful to delimit the left and right periphery of (what will eventually become) the QTP by adding DOUBLE QUOTEs. The annotator can add DOUBLE QUOTEs at any time before the parsing stage, probably most conveniently during punctuating. Since the transcript needs to be punctuated anyway, no significant human effort is added. But if marking direct speech and adding QTP is left to the parsing correction stage, the human effort is considerable, since the annotator there has to essentially duplicate the effort of the punctuating pass by reading each token to determine if it is an instance of direct speech.

It is possible to shave some time off the addition of double quotes at the punctuating stage by adding only initial quotes in tokens that will eventually be dominated by a root QTP. Any automatic procedure that encloses tokens delimited by initial and final quotation marks in QTP can be revised to add a final quotation in addition to the QTP. But it is possible to err on the side of too much cleverness, and it may be wiser and less confusing to everyone concerned to add both open and close quotes explicitly by hand.

It is important to understand that once the QTP brackets are part of the parsed structure, the quotation marks themselves can be deleted without loss of information. If they are kept, they are redundant. If deleted, they can be restored algorithmically (that is, by a script and without case-by-case human review).

Splitting and joining words

In general, splitting and joining words should be done by script and not by hand. However, for cases that have been overlooked by the relevant scripts, splitting and joining can be done at any stage of processing in accordance with the applicable conventions. It goes without saying that the cases should be added to the relevant scripts to expedite future processing.

Tag questions

TAG is used as a provisional disfluency marker to enclose tag questions (as opposed to other parenthetical questions). Once tag questions are marked with a TAG dash tag (CP-QUE-TAG), the TAG disfluency marker is replaced by PAREN.

Problematic characters

Exclamation point

As noted in the section on
punctuation, the use of EXCLAMATION POINT (affectionately referred to as "bang" in programming circles) is deprecated as it interacts with the syntax of shells and therefore often leads to problems in connection with regular expression searches.

A workaround is to provisionally replace EXCLAMATION POINT in the private version of the corpus with some innocuous expression that is not part of the corpus itself (say, XBANGX) and to replace XBANGX by the punctuation mark in the public corpus. script that generates the public release version of the corpus.

An alternative is to use XBANGX during the parsing correction phase to attach an appropriate dash tag to the root of the token in question, yielding, say, IP-MAT-EXL. Once this is done, XBANGX can be replaced by PERIOD, since the information it contains is now represented on the dash tag.

Quotation marks

Quotation marks, both single and double, can raise more or less serious complications for formatting and searching in the course of corpus building. The workaround is to replace them in the private corpus with innocuous sequences (SINGLE_QUOTE and DOUBLE_QUOTE, SQUO and DQUO, or the like) and to restore them in the public version.