Skip to main content

Text Mining at Penn Libraries

A guide to text mining resources at Penn Libraries

Example: The Pace of Change of Literary Standards

We know that standards for literary value change over time. Moby Dick, now canonized as a classic of American literature, was regarded as a commercial and a critical failure for seventy years after its publication. But how quickly do literary standards change? To try to answer that question, Ted Underwood and Jordan Sellers started a text mining project to track the pace of that change. In addition to publishing an article based on their project, "The Longue Durée of Literary Prestige," they released the data and code they used in a github repository.

To build their corpus, Underwood and Sellers narrowed their focus to nineteenth-century poetry, and created two lists. The first was a list of 360 volumes of poetry that had been published roughly between 1820 and 1920, and that had been reviewed in at least one prestigious literary periodical of the era. The second was another list of 360 volumes that had been published in the same period, but that they selected at random from among all the volumes in the HathiTrust Digital Library. They then acquired the full text data and compiled word frequency counts and detailed metadata for each volume. In addition to containing information about title, publication date, author name, gender, and nationality, their metadata contained a field indicating whether a given volume was on the "reviewed" or the "random" list.

With this data, they trained supervised machine learning software to use word frequencies to guess the list on which a given volume would appear. They tested the software by training it on portions of the data and testing it on the remainder, and its accuracy came close to 80%. Finally, by examining when and how the software made errors, they developed their argument. Finding that when they trained the software to detect reviewed volumes from the beginning of the century, it was also good at detecting reviewed volumes from the end of the century, they argued that the literary standards under discussion were relatively stable over the nineteenth century.

Existing Corpora

The process of building a corpus is complex, and researchers just getting started with text mining might want to rely on corpora that have been compiled and rigorously validated by others. The corpora listed on this page were built with particular goals or research questions in mind, and are likely to provide solid foundations for a first text mining project. Some are very large, and can be broken into subsets useful for answering questions about narrower time spans, genres, or geographic regions. Researchers interested in building entirely new corpora should look at Sources of Text Data instead.

Corpora Available to Penn Affiliates

Corpora Freely Available to the Public

Library TDM Services

Penn Libraries is committed to providing research support and services for TDM projects to students and researchers across campus. To make these services accessible to a large audience, they are limited in scope.

We can provide

  • This research guide, with resources, contact information, and other details.
  • Consultations and workshops with specialists at the library and elsewhere.
  • Negotiation with vendors for university-wide text mining licenses, subject to budgetary approval.

We generally cannot provide

  • Licensing for individual text mining projects with needs not covered by university-wide licenses.
  • Secure, long-term storage for vendor data or research output.
  • Enforcement of restrictions on vendor data use.

Depending on vendor policies and our own support capacity, we may be able to provide some of the above services on a case-by-case basis, but we can make no guarantees. These limits have been adapted from the Northeastern University TDM guide.

Library TDM Contacts

Scott Enderle, Digital Humanities Specialist Librarian.