How the Penn Parsed Corpora of Historical English came to be at Penn

Abstract

In this presentation, I propose to present a timeline of the Penn Parsed Corpora of Historical English (PPCHE) and to discuss the various factors that contributed more or less directly to the development of this resource. These include:

the core team of contributors,
the intellectual ecosystem in linguistics and computational linguistics at the University of Pennsylvania,
the origins and development of that ecosystem, including
- the breakup of Bell Labs, with the resulting exodus of leading computational linguists to academia
- the Penn Treebank Project and related computational advances
in a more speculative vein, the philosophical tradition of American pragmatism.

I would also like to raise some outstanding issues, to be addressed, I hope, with the collaboration of the DIGS community.

Timeline of grants

Corpus or deliverable	Size in words	Personnel	Grant period	Funding source	Release date
Penn Treebank, Phase 1	1M, then 5M, with goal of demonstrating feasibility of 100M	Mitch Marcus, Beatrice Santorini, David Magerman, various annotators	1989–1992	1999 (final release)
Penn-Helsinki Parsed Corpus of Middle English, 1st edition	510K	Tony Kroch, Ann Taylor	1990–1993	NSF, Head-complement word order in the history of the West Germanic clause	1994
Penn-Helsinki Parsed Corpus of Middle English, 2nd edition	1.3M	Tony Kroch, Ann Taylor	1996–1999	NSF, The historical syntax of Middle English from a comparative perspective	2000
CorpusSearch		Tony Kroch, Beth Randall	—	Beth Randall's M.S. thesis, corpus sales	ca. 1995–ca. 2015
Penn-Helsinki Parsed Corpus of Early Modern English	1.8M	Tony Kroch, Beatrice Santorini, Lauren Delfs	1999–2003	NEH, Creating an electronic corpus of Early Modern English	2005
Penn-Helsinki Parsed Corpus of Early Modern English	1.8M	Tony Kroch, Beatrice Santorini, Lauren Delfs	1999–2004	NSF, The emergence of Modern English syntax	2005
Penn Parsed Corpus of Modern British English, 1st edition	1M	Tony Kroch, Beatrice Santorini, Ariel Diertani (formerly Lauren Delfs)	2004–2011	NSF, A parsed historical corpus of Modern British English	2010
Penn Parsed Corpus of Modern British English, 1st edition	1M		2005–2006	NSF, Enriching parser output for treebank construction	2010
Penn Parsed Corpus of Modern British English, 2nd edition	3M		2010–2016	NSF, Testing and improving methods for efficient annotation through the construction of a large parsed corpus	2016

Dramatis personae

Tony Kroch

Originally trained as an anthropologist. Undergraduate thesis on the structure of myth; fieldwork in Sénégal (Bassari) and Brazil (Kayapo).

Interest in linguistics via anthropology. Like some others at MIT, tries to avoid getting sucked into the Generative Syntax wars. Talent for syntax-semantics interface. Disenchanted with conventional generative syntax. Gets fired from University of Connecticut for anti-racist organizing. Moves to Philadelphia for a job at Temple University. Eventually quits that job and gets an NSF grant to conduct sociolinguistic interviews with the Philadelphia upper class to supplement Bill Labov's work on middle class and working class speech patterns. Already knows Bill from anti-racist organizing at the LSA.

Gets hired at Penn. Develops interest in do support in connection with questions about the diffusion of linguistic change. Also develops close connections with Aravind Joshi and Mitch Marcus in Computer and Information Sciences (CIS).

The variationist connection

William Labov

Bill is important for our purposes because he's the reason that Tony got hired at Penn.
Bill had come to Penn from Columbia in order to study the Philadelphia sound system, with a view towards "using the present to understand the past".
Tony was interested in linguistic change for the same kinds of reasons, though not exactly the same reason. He thought that perturbations of a linguistic system were likely to yield insights into the structure of the system that were otherwise inaccessible. Like whacking a crystal and splitting it along a plane of cleavage.
Bill and Tony in Bill's lab, long since demolished and replaced by Wharton, Penn's business school

Don Hindle

Student of Bill Labov's. Now goes by Morris.
Don wrote a thesis on phonetic variation, but his earliest interest in linguistics concerned the difference between spoken and written language. Tony and Don get joint grant from U.S. Department of Education to study the topic.
Don writes a deterministic parser with the somewhat peculiar property that it provides only parses of structures that it is sure about.
The parser is called Fidditch (from an essay by Martin Joos from the 1960s, which features a prototypically prescriptive English teacher by the name of Miss Fidditch).

The computational linguistics connection

Background: Bell Labs

Originally a part of AT&T, phone monopoly (cf. Deutsche Telekom). Site of many important discoveries, including radio astronomy, transistor, laser, information theory, Unix, C and C++, solar cells, ...)
Targeted by an anti-trust suit in 1974. Breakup mandated in 1982. AT&T retained control of half of Bell Labs; the other half went to newly-formed regional telephone companies ("Baby Bells").
In its heyday, Bell Labs was attractive because you could do research without the responsibilities of teaching.
The breakup of AT&T made the environment much less secure. So during the 1980s, several computational linguists at Bell Labs jumped ship for conventional academic positions.
Part of the exodus: Mitch Marcus (and also Mark Liberman).

Mitch Marcus

Mitch's Ph.D. thesis project at MIT was a pioneering deterministic parser.
Mitch and Tony knew each other from conferences from the early 1980s on. At one of these conferences, Tony had convinced Mitch of the existence of resumptive pronouns. Mitch was initially skeptical, but Tony kept track of resumptive pronouns that Mitch was using in the conversation and repeated them back at an appropriate point in the discussion.
Tony was on the committee to hire Mitch. (He had been put there by Aravind Joshi, the chair of CIS at the time and a major force behind the exceptionally close relation among linguistics, computer science, and psychology at Penn. Aravind had developed a mathematical formalism called Tree-Adjoining Grammar, and Tony collaborated with Aravind and his students in exploring its relevance to syntactic theory.)
Immediately after coming to Penn in 1988, Mitch headed the Penn Treebank Project.
In addition to his work in computational linguistics, Mitch was interested in lingustics, and specifically historical linguistics. His tutor at Harvard was Susumu Kuno, who later taught linguistics there, but was also interested in computational linguistics (specifically, machine translation). Mitch admired the work of Lila Gleitman, in the Penn Psychology Department, who worked on language acquisition. He supervised the theses of several Ph.D. students who could just as well have gotten degrees in linguistics for their work, and he has co-authored many papers in Charles Yang, The close connection among computational linguistics, linguistics, and psychology was a major reason that Mitch chose to come to Penn, and he in turn contributed greatly to it.

Mitch's students

David Magerman (undergraduate programmer for Penn Treebank)
Eric Brill (POS tagging)
Michael Collins (statistical parsing)
Dan Bikel (statistical parsing)

The companheiras

The name comes from a conference organized by Charlotte Galves in Campinas. We used it there to refer to Beatrice and a group of Portuguese linguists who were annotating historical corpora of Portuguese.
Here, it refers to:
- Beatrice Santorini (Penn Treebank administrator and PPCHE annotator)
- Ann Taylor (Penn Treebank administrator and PPCHE annotator)
- Lauren Delfs (now Ariel Diertani) (PPCHE annotator)
- Beth Randall (programmer, CorpusSearch)
Ann Taylor started working on the PPCME2 in 1990. In 1991, she also took over the Penn Treebank from Beatrice, when Beatrice left Penn for Northwestern.
In 1997, Beatrice returned to Penn. Tony created positions for her that involved 50% teaching for the Penn Department of Linguistics and 50% other activities (teaching for other departments, corpus construction funded by grants and corpus sales).
Ann Taylor left Penn for York (England), and Beatrice took over annotating the PPCHE.
Midway through the construction of the PPCEME, the work was significantly behind schedule (this was before the existence of the CorpusSearch revision feature). Tony hired Lauren Delfs, a grad student in the Penn Linguistics Department, to get the work back on schedule.
After she finished her degree, Lauren/Ariel continued to work on the PPCHE as well as other corpus projects at Penn and elsewhere.
More about Beth Randall in connetion with CorpusSearch.

The raw material

The Helsinki Corpus

Neither the Penn Treebank nor the PPCHE were created out of thin air. In both cases, forward-looking linguists had already created online corpora consisting of raw text.
In the case of the PPCHE, the raw material was the Helsinki Corpus, compiled by Matti Rissanen at the University of Helsinki and his team of Finnish companheiras (or seuralaisia, if Google Translate is to be trusted).
Rissanen was very cooperative and happy to let Tony use the Helsinki Corpus as the basis for the parsed corpora.

Data entry by Blaise

The Helsinki Corpus doesn't contain texts beyond roughly 1700, so for Modern British English, we were on our own.
For the 18th century, we used the Eighteenth-Century Collection Online (ECCO) as our main basis for texts. For the remaining time up to 1914, we found suitable print texts.
For simplicity, we replicated the genre distribution of the Early Modern English portion of the Helsinki Corpus.
The online text for the PPCMBE were produced from downloaded page images from ECCO and from scans for the remaining sources. We sent these images to an Indian data entry service called Blaise, which did a superlative job.
https://www.blaiseinfo.com/ITES.htm
The first edition of the PPCMBE consisted of 1M words. But right from the beginning, Tony was thinking ahead to a second edition, to include a further 2M words. Since text entry is a major bottleneck, we had Blaise enter the entire 3M words in one go.

Remaining points

Why did Mitch get the DARPA grant to build the Penn Treebank? In the late 1908s, Fred Jelinek at IBM was interested in building a large database to serve as a testbed for statistical parsing (in other words, the approach to natural language processing that has given rise to Large Language Models). He famously remarked, "Every time I fire a linguist, my payroll goes down and the performance of my speech recognizer goes up."
Jelinek didn't consider IBM to be a suitable place for that testbed to be constructed or to be maintained. As he said in his ACL Lifetime Achievement Award talk,
"One of the problems was where the eventual corpus should reside. Deep-pocketed IBM would be unsuitable: Possessors of desirable corpora would charge immoderate sums for the acquisition of rights. I thought that only a university would do."
After Mitch got the grant to build the Penn Treebank, he needed an administrator for it. Tony proposed Beatrice and later Ann Taylor. The position was a half-time position, and the other half of the administrator's salary was covered by other grants. In Beatrice's case, this was work related to Tree-Adjoining Grammar, funded by grants to Aravind Joshi. (Joshi funded so many students that we nicknamed him Ganesh, because Ganesh is known as the "remover of obstacles" in Hinduism.) In Ann's case, the second half of her salary was funded by Tony's grant to build the PPCME2.
Initially, the parser for the Penn Treebank was Don Hindle's Fidditch. The output was hand-corrected by a group of five or so annotators and further reviewed for correctness and consistency by whoever was administering the Treebank at the time (Beatrice or Ann).
Given his close connection with Mitch, Tony was able to leverage the work on the Penn Treebank for the PPCHE. The Penn Treebank's annotation philosophy was carried over wholesale. I can't tell you whether that philosophy is due more to Mitch or to Tony.
- Only annotate uncontroversial properties.
- Use defaults for unclear or ambiguous cases.
- Monotonic addition of information.
At first, the Penn Treebank tools were not very suitable for historical data. The rules underlying the parsers were for modern English, and they would have been unable to parse the rampant word order variation characteristic of Middle English. So the first edition of the PPCME2 was skeletally parsed with Perl scripts written by Ann. This is essentially the same idea as using the corpus revision feature of CorpusSearch to produce a parse from POS-tagged text.
But over time, Mitch's students developed trainable taggers and parsers. In particular, Michael Collins was interested in optimizing the trainability of his parser, and he saw the utility of training on Middle English. His parses were much more accurate than those produced by Fidditch, and the time spent correcting the parses went down by a factor of 4.
The first grants to build the historical corpora were conventional research grants. Tony pitched the first NSF grant to Paul Chapin (the project officer for linguistics) as "We'll get some results based on a corpus that (by the way) we'll be building along the way."
Later grants were explicitly grants to build corpora. The last grants were pitched as tools to improve parser performance.
From the beginning, the NSF granted Tony permission to sell the corpora. This generated a rainy-day fund to help cover Beatrice's salary and also made it possible to pay Beth Randall to develop CorpusSearch.

CorpusSearch

Rich Pito (a student of Mitch's) developed a search tool for the Penn Treebank called tgrep. A major shortcoming was that tgrep couldn't be used on its own output.
Meanwhile, Beth Randall was pursuing a Master's degree in computer science at Drexel University in Philadelphia and needed a thesis project.
Beth is married to Don Ringe, the historical linguist at Penn, and so Tony's colleague. Tony and Don had a close working relationship and were co-authors (together with Ann Taylor) on papers on Middle English syntax. So Tony heard about Beth's problem from Don.
Tony persuaded Beth's thesis supervisor that a program should count as a thesis project. Beth wrote the program and for many years thereafter added features at the request of Tony and also Ann Taylor and Susan Pintzuk at York. As mentioned earlier, she was paid with proceeds from corpus sales.
CorpusSearch is crucial not just for searching corpora, but the corpus revision feature is a game-changer for corpus construction.
As far as I am aware, there is no other tool like it in computational linguistics.

General reflections

Tony had an entrepreneurial streak. He was good at identifying resources and very persuasive in the pursuit of his goals.
He knew talent when he saw it and was willing to work with quite idiosyncratic people when they had talents that he needed.
The companheiras were not particularly interested in conventional academic careers. But they were happy to contribute to what they considered intellectually interesting projects.
The creation of the PPCHE was facilitated by an auspicious constellation of events in the 1980s and 1990s:
- Breakup of Bell Labs and exodus of technical staff to academia
- Industry need for large amounts of parsed data; development of computational tools
- Tony's connections with Penn CIS
- Companheiras in need of jobs, with appropriate lingustic training and basic computational literacy
- Tony's grantsmanship and resourcefulness in finding salaries

Outstanding issues

Distribution

PPCHE are currently being distributed by Linguistic Data Consortium (LDC) for a fee.
Corpus sales paid half of Beatrice's salary for the last two years.
Salary support is no longer needed.
LDC doesn't have exclusive distribution rights, so PPCHE could be distributed elsewhere free of charge.
It would be optimal to have a site with the competence to maintain the corpora (correct errors, ...)

Lemmatization

Parsed Corpus of Early English Correspondence

References

Randall, Beth. N.d.: Notes related to history of CorpusSearch and PPCHE more generally. https://www.ling.upenn.edu/~dringe/CorpStuff/Thesis/history.html
Taylor, Ann. 2020.: Treebanks in Historical Syntax. Annual Review of Linguistics 6:1, 195-212. https://www.annualreviews.org/doi/10.1146/annurev-linguistics-011619-030515.