Splitting and joining words


At times, orthographic conventions conflict with the requirements of morphosyntactic annotation. The conflicts are resolved by splitting and joining orthographic words (sequences of characters without whitespace) according to conventions that allow morphosyntactic annotation to proceed, while at the same time preserving information that allows the conventional orthographic form to be restored, either with reference to a short list (in the case of clitics) or completely algorithmically (otherwise).

Splitting words

When orthographic words are split, the site of the split is marked by a trailing AT sign (@) on the first part and a leading AT sign (@) on the second.
( (IP-MAT (NP-SBJ (PRO We@))
	  (VP (BEP @'re)
	      (VP (VAG gon@)
		  (IP-INF (TO @na)
			  (VP (VB go)
			      (ADVP-TMP (ADV now))))))
	  (PUNC .)))

( (IP-IMP (VP (VBI Lem@)
	      (IP-ECM (NP-SBJ (PRO @me))
		      (VP (VB go))))
	  (PUNC .)))

( (IP-IMP (VP (VBI Let@)
	      (IP-ECM (NP-SBJ (PRO @'s))
		      (VP (VB go))))
	  (PUNC .)))

( (IP-MAT (NP-SBJ (D That))
	  (VP (MD wo@)
	      (NEG @n't)
	      (VP (DO do)))
	  (PUNC .)))

Joining words

It is not always obvious when to join two orthographic words and when to treat them as a compound word. When standard orthography is not a guide, as it is in the case of "mother-in-law" below, a helpful rule of thumb is to consider the function of the sequence in context and ask what POS label best fits that function. If none of the POS tags for the constituents match that label, joining is the more intuitive option. For instance, in the first example below, "ten quart" modifies the noun and ADJ is the best-fit label for the sequence. But neither "ten" nor "quart" are adjectives. So it is best to join. By contrast, say, "high school" has the same distribution as the noun "school" and one of the constituents is a noun. In this case, it is best not to join but to treat the sequence as a compound noun.

In addition to single orthographic words that need to be split, there is also the converse case of multi-word sequences that must be (or are best) treated as single orthographic words. There are two cases to consider: either the desired orthographic word reflects standard orthographic conventions or not.

In the first case, the text is revised in accordance with standard orthography by adding a hyphen.

( (NP (D a)
      (ADJP (ADJ ten-quart))				← like this
      (N bottle)))

( (NP (D a)
      (ADJP (ADJ-COMP (NUM ten) (N quart)))		← not like this
      (N bottle)))

( (NP (NP-POS (PRO$ her))
      (N mother-in-law)))				← like this

( (NP (NP-POS (PRO$ her))
      (N mother)					← not like this
      (PP (P in)
	  (NP (N law)))))

In the second case, we join the individual parts into a single orthographic word with an EQUALS sign (=). The original orthographic form can be recovered algorithmically by replacing a word-internal EQUALS sign with SPACE. (The EQUALS sign has a further use, described in
Clitics, which precludes globally replacing EQUALS with SPACE.)

The following list is an exhaustive list of joined words and morphemes in the APPCAppE with their POS tag; it can serve as a guide for the other corpora. The "Item" column gives citation forms; the actually occurring forms may be inflected (notably, for plural and possessive).

Item POS tag Examples and notes
=all WPRO
PRO
PRO$
what=all
you=all
your=all's
ever=how WADV
ever=what, ever=which, every=who WD, WPRO
every=which Q  
=ful N bag=ful, can=ful, glass=ful, jar=ful, pan=ful, pot=ful
=here D these=here, this=here
how=come WADV  
how=much N as a synonym of AMOUNT; not as the mass noun counterpart to HOW MANY
kind=of ADV also: kindly=of
like=to ADV  
no=count ADJ  
no=how ADV  
sort=of ADV  
such=like ADJR  
=there C, D that=there, them=there
=un N old=un, young=un
=uns PRO them=uns, we=uns, you=uns
up=raise VB presumably a nonce use
used=to ADV  
you=guys PRO  

Certain difficult cases defy the conventions, and we annotate them as best we can (here, using X).

( (CP-QUE-MAT (CODE <KShepherd_xmin=344.64>)
	      (INTJ Um)
	      (CODE <$$KShepherd_xmax=345.03>)
	      (CODE <KShepherd_xmin=346.24>)
	      (WADVP-1 (WADV when))
	      (IP-QQQ (IGNORE-DOD did)
		      (NP-SBJ (PRO you=@)
			      (DIS-FS (FS-FS m-))
			      (CODE <$$KShepherd_xmax=346.93>)
			      (CODE <KShepherd_xmin=347.93>)
			      (X @all))
		      (VP (GT get)
			  (VP (ADVP-TMP *T*-1)
			      (VAN married))))
	      (PUNC ?)
	      (CODE <$$KShepherd_xmax=348.75>)))

Topics

Acronyms

In contrast to initials in (nick)names of persons, the letters in acronyms are grouped together as far as feasible. In APPCAppE and CoNyCE, the letters are not followed by trailing periods.

( (NP (N IV)))

( (NP (N TB)))

( (NP (N TV)))

( (NP (N-COMP (NPR US) (NPR Steel))))

( (NP (D the)
      (NPR UMWA)))

( (NP (D the)
      (N-COMP (NPR UMW)
	      (PP (P of)
		  (NP (NPR A))))))

( (NP (D the)
      (NPR USA)))

( (NP (D the)
      (N-COMP (NPR US)
	      (PP (P of)
		  (NP (NPR A))))))

Clitics

As discussed in
Joining words, the EQUALS sign can be used to join two orthographic words in order to represent them as a single lexical item.

The EQUALS sign is also used to indicate the clitic status of items that are phonologically reduced, but do not form a lexical item with an adjacent orthographic word.

Apart from the cases in Joining words, we do not join clitic forms with their hosts because doing so would massively complicate the annotation system as well as the queries necessary to search the corpus. For instance, if "THIS =UN" were instead represented as "THIS=UN", the POS tag system would have to be complicated to include, in this case, a compound tag D+N. This would have the unwelcome result that CorpusSearch queries targeting D or N would need to include the new compound tag. Enforcing consistency across queries would also be more difficult.

The items treated as clitics in the APPCAppE are given below:

Corpus form Conventional orthography
Proclitic (leans right) a= a- joined with following verb
Enclitic (leans left) =ud 'd
=un(s) 'un(s)
=uv 've
=uz 's

Initials

Initial letters in (nick)names of persons are not treated as acronyms. Instead, each letter is treated as a separate terminal node. In APPCAppE and CoNYCE, there are no trailing periods (as with acronyms).
( (NP (N-COMP (NPR F) (NPR D) (NPR R))))

( (NP (N-COMP (NPR J) (NPR C) (NPR Hall))))