As textual data has proliferated, researchers have begun using computers to search, organize, and analyze it. Text Mining, Data Mining, and the abbreviation TDM are broad terms for these developing research practices, which often involve building and processing a corpus: a collection of text that may contain millions or even billions of words.
The methods used to process corpora vary widely between disciplines, and are based on insights from machine learning, statistics, computational linguistics, sociology, and many other fields. A guide such as this can only provide the broadest of introductions to text mining, and we encourage researchers and students interested in starting a text mining project to contact their library liaison or one of our digital scholarship specialists.
Because some online databases can now provide trillions of words of textual data, covering huge historical and geographic spans, researchers getting started in text mining may be tempted to collect "all the data" and expect text mining tools to produce useful results automatically. Unfortunately, data provided at this scale is rarely organized enough to produce meaningful results without substantial modification. Often, a carefully constructed sampling of the data will yield more useful results than "all the data," which is likely to suffer from hidden biases and unpredictable gaps in coverage.
In this guide, we use source or database to describe data collected without a strong organizing principle, and corpus to describe data collected with specific questions in mind about certain geographic regions, time periods, or social phenomena. This guide will help researchers use available data sources to construct a corpus that can support an intellectually rigorous argument.
This guide focuses on the first step necessary for any text mining project: building a corpus. Although large quantities of textual data are freely available, acquiring the particular data necessary to answer a given research question can be a challenge. Many database vendors place restrictions on automated downloading, and some may require one-time payments from researchers who want to use their data. Others provide text mining data at no additional cost to Penn-affiliated researchers, and still others provide free access to the general public. This guide will help you to navigate these resources to quickly find the most useful, most readily available data sources for your project, and to compile them into a coherent corpus.
Once you have built your corpus, you will need to use specialized software to analyze it. Different kinds of software are suited to different disciplines, and it would be impossible to cover the full range of possibilities in a brief guide. This guide will describe a few general-purpose tools that may be useful to researchers in many fields. Researchers hoping to move beyond these general purpose tools should consult with their library liaisons, with digital scholarship specialists at the library and elsewhere on campus, and with other researchers in their own field.
Penn Libraries is committed to providing research support and services for TDM projects to students and researchers across campus. To make these services accessible to a large audience, they are limited in scope.
We can provide
We generally cannot provide
Depending on vendor policies and our own support capacity, we may be able to provide some of the above services on a case-by-case basis, but we can make no guarantees. These limits have been adapted from the Northeastern University TDM guide.