Skip to main content
Click logo to go to Libraries homepage

Text Mining at Penn Libraries

A guide to text mining resources at Penn Libraries

What is text mining?

As textual data has proliferated, researchers have begun using computers to search, organize, and analyze it. Text MiningData Mining, and the abbreviation TDM are broad terms for these developing research practices, which often involve building and processing a corpus: a collection of text that may contain millions or even billions of words.

The methods used to process corpora vary widely between disciplines, and are based on insights from machine learning, statistics, computational linguistics, sociology, and many other fields. A guide such as this can only provide the broadest of introductions to text mining, and we encourage researchers and students interested in starting a text mining project to contact their library liaison or one of our digital scholarship specialists.

From data to corpus

Because some online databases can now provide trillions of words of textual data, covering huge historical and geographic spans, researchers getting started in text mining may be tempted to collect "all the data" and expect text mining tools to produce useful results automatically. Unfortunately, data provided at this scale is rarely organized enough to produce meaningful results without substantial modification. Often, a carefully constructed sampling of the data will yield more useful results than "all the data," which is likely to suffer from hidden biases and unpredictable gaps in coverage.

In this guide, we use source or database to describe data collected without a strong organizing principle, and corpus to describe data collected with specific questions in mind about certain geographic regions, time periods, or social phenomena. This guide will help researchers use available data sources to construct a corpus that can support an intellectually rigorous argument.

Scope of this guide

This guide focuses on the first step necessary for any text mining project: building a corpus. Although large quantities of textual data are freely available, acquiring the particular data necessary to answer a given research question can be a challenge. Many database vendors place restrictions on automated downloading, and some may require one-time payments from researchers who want to use their data. Others provide text mining data at no additional cost to Penn-affiliated researchers, and still others provide free access to the general public. This guide will help you to navigate these resources to quickly find the most useful, most readily available data sources for your project, and to compile them into a coherent corpus.

Once you have built your corpus, you will need to use specialized software to analyze it. Different kinds of software are suited to different disciplines, and it would be impossible to cover the full range of possibilities in a brief guide. This guide will describe a few general-purpose tools that may be useful to researchers in many fields. Researchers hoping to move beyond these general purpose tools should consult with their library liaisons, with digital scholarship specialists at the library and elsewhere on campus, and with other researchers in their own field.

Library TDM Services

Penn Libraries is committed to providing research support and services for TDM projects to students and researchers across campus. To make these services accessible to a large audience, they are limited in scope.

We can provide

  • This research guide, with resources, contact information, and other details.
  • Consultations and workshops with specialists at the library and elsewhere.
  • Negotiation with vendors for university-wide text mining licenses, subject to budgetary approval.

We generally cannot provide

  • Licensing for individual text mining projects with needs not covered by university-wide licenses.
  • Secure, long-term storage for vendor data or research output.
  • Enforcement of restrictions on vendor data use.

Depending on vendor policies and our own support capacity, we may be able to provide some of the above services on a case-by-case basis, but we can make no guarantees. These limits have been adapted from the Northeastern University TDM guide.

Library TDM Contacts

Scott Enderle, Digital Humanities Specialist Librarian.