How the Penn Parsed Corpora of Historical English came to be at Penn

Abstract

In this presentation, I propose to present a timeline of the Penn Parsed Corpora of Historical English (PPCHE) and to discuss the various factors that contributed more or less directly to the development of this resource. These include: I would also like to raise some outstanding issues, to be addressed, I hope, with the collaboration of the DIGS community.

Timeline of grants

Corpus or deliverable Size in words Personnel Grant period Funding source Release date
Penn Treebank, Phase 1 1M, then 5M, with goal of demonstrating feasibility of 100M Mitch Marcus, Beatrice Santorini, David Magerman, various annotators 1989–1992 1999 (final release)
Penn-Helsinki Parsed Corpus of Middle English, 1st edition 510K Tony Kroch, Ann Taylor 1990–1993 NSF, Head-complement word order in the history of the West Germanic clause 1994
Penn-Helsinki Parsed Corpus of Middle English, 2nd edition 1.3M 1996–1999 NSF, The historical syntax of Middle English from a comparative perspective 2000
CorpusSearch   Tony Kroch, Beth Randall Beth Randall's M.S. thesis, corpus sales ca. 1995–ca. 2015
Penn-Helsinki Parsed Corpus of Early Modern English 1.8M Tony Kroch, Beatrice Santorini, Lauren Delfs 1999–2003 NEH, Creating an electronic corpus of Early Modern English 2005
1999–2004 NSF, The emergence of Modern English syntax
Penn Parsed Corpus of Modern British English, 1st edition 1M Tony Kroch, Beatrice Santorini, Ariel Diertani (formerly Lauren Delfs) 2004–2011 NSF, A parsed historical corpus of Modern British English 2010
2005–2006 NSF, Enriching parser output for treebank construction
Penn Parsed Corpus of Modern British English, 2nd edition 3M 2010–2016 NSF, Testing and improving methods for efficient annotation through the construction of a large parsed corpus 2016

Dramatis personae

Tony Kroch

Originally trained as an anthropologist. Undergraduate thesis on the structure of myth; fieldwork in Sénégal (Bassari) and Brazil (Kayapo).

Interest in linguistics via anthropology. Like some others at MIT, tries to avoid getting sucked into the Generative Syntax wars. Talent for syntax-semantics interface. Disenchanted with conventional generative syntax. Gets fired from University of Connecticut for anti-racist organizing. Moves to Philadelphia for a job at Temple University. Eventually quits that job and gets an NSF grant to conduct sociolinguistic interviews with the Philadelphia upper class to supplement Bill Labov's work on middle class and working class speech patterns. Already knows Bill from anti-racist organizing at the LSA.

Gets hired at Penn. Develops interest in do support in connection with questions about the diffusion of linguistic change. Also develops close connections with Aravind Joshi and Mitch Marcus in Computer and Information Sciences (CIS).

The variationist connection

William Labov

Don Hindle

The computational linguistics connection

Background: Bell Labs

Mitch Marcus

Mitch's students

The companheiras

The raw material

The Helsinki Corpus

Data entry by Blaise

Remaining points

CorpusSearch

General reflections

Outstanding issues

Distribution

Lemmatization

Parsed Corpus of Early English Correspondence

References

Randall, Beth. N.d.
Notes related to history of CorpusSearch and PPCHE more generally. https://www.ling.upenn.edu/~dringe/CorpStuff/Thesis/history.html

Taylor, Ann. 2020.
Treebanks in Historical Syntax. Annual Review of Linguistics 6:1, 195-212. https://www.annualreviews.org/doi/10.1146/annurev-linguistics-011619-030515.