Corpus or deliverable | Size in words | Personnel | Funding source | Release date | |
---|---|---|---|---|---|
Penn Treebank, Phase 1 | 1M, then 5M, with goal of demonstrating feasibility of 100M | Mitch Marcus, Beatrice Santorini, David Magerman, various annotators | 1999 (final release) | ||
Penn-Helsinki Parsed Corpus of Middle English, 1st edition | 510K | Tony Kroch, Ann Taylor | NSF, Head-complement word order in the history of the West Germanic clause | 1994 | |
Penn-Helsinki Parsed Corpus of Middle English, 2nd edition | 1.3M | NSF, The historical syntax of Middle English from a comparative perspective | 2000 | ||
CorpusSearch | Tony Kroch, Beth Randall | — | Beth Randall's M.S. thesis, corpus sales | ||
Penn-Helsinki Parsed Corpus of Early Modern English | 1.8M | Tony Kroch, Beatrice Santorini, Lauren Delfs | NEH, Creating an electronic corpus of Early Modern English | 2005 | |
1999–2004 | NSF, The emergence of Modern English syntax | ||||
Penn Parsed Corpus of Modern British English, 1st edition | 1M | Tony Kroch, Beatrice Santorini, Ariel Diertani (formerly Lauren Delfs) | NSF, A parsed historical corpus of Modern British English | 2010 | |
NSF, Enriching parser output for treebank construction | |||||
Penn Parsed Corpus of Modern British English, 2nd edition | 3M | NSF, Testing and improving methods for efficient annotation through the construction of a large parsed corpus | 2016 |
Interest in linguistics via anthropology. Like some others at MIT, tries to avoid getting sucked into the Generative Syntax wars. Talent for syntax-semantics interface. Disenchanted with conventional generative syntax. Gets fired from University of Connecticut for anti-racist organizing. Moves to Philadelphia for a job at Temple University. Eventually quits that job and gets an NSF grant to conduct sociolinguistic interviews with the Philadelphia upper class to supplement Bill Labov's work on middle class and working class speech patterns. Already knows Bill from anti-racist organizing at the LSA.
Gets hired at Penn. Develops interest in do support in connection with questions about the diffusion of linguistic change. Also develops close connections with Aravind Joshi and Mitch Marcus in Computer and Information Sciences (CIS).
"One of the problems was where the eventual corpus should reside. Deep-pocketed IBM would be unsuitable: Possessors of desirable corpora would charge immoderate sums for the acquisition of rights. I thought that only a university would do."