Skip to main content

Text Mining at Penn Libraries

A guide to text mining resources at Penn Libraries

Sources of Text Data

This is a catalog of text data sources available to researchers at Penn. The data from these sources has not been gathered with any one project in mind, and may be biased in unpredictable ways. It's like data from a poll that allows voters to vote as often as they like: it might be reliable for a particular project, but it might not be, and there's not an easy way to tell.

That places an additional burden on researchers using data from these sources. To ensure the consistency and intellectual coherence of their corpus, they may need to compile an ideal list of documents based on well-defined criteria, and then collect the items on that list from different sources. That strategy will ensure that most biases in the corpus will arise from a known source: the criteria used by the researchers to compile the list. Those willing to accept this extra burden will be able to build new, potentially ground-breaking corpora in an intellectually rigorous way.

Text Data Sources Available to Penn Affiliates

Text Data Sources Freely Available to the Public

Library TDM Services

Penn Libraries is committed to providing research support and services for TDM projects to students and researchers across campus. To make these services accessible to a large audience, they are limited in scope.

We can provide

  • This research guide, with resources, contact information, and other details.
  • Consultations and workshops with specialists at the library and elsewhere.
  • Negotiation with vendors for university-wide text mining licenses, subject to budgetary approval.

We generally cannot provide

  • Licensing for individual text mining projects with needs not covered by university-wide licenses.
  • Secure, long-term storage for vendor data or research output.
  • Enforcement of restrictions on vendor data use.

Depending on vendor policies and our own support capacity, we may be able to provide some of the above services on a case-by-case basis, but we can make no guarantees. These limits have been adapted from the Northeastern University TDM guide.

Library TDM Contacts

Scott Enderle, Digital Humanities Specialist Librarian.