Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

Sources of Text Data

This is a catalog of text data sources available to researchers at Penn. The data from these sources has not been gathered with any one project in mind, and may be biased in unpredictable ways. It's like data from a poll that allows voters to vote as often as they like: it might be reliable for a particular project, but it might not be, and there's not an easy way to tell.

That places an additional burden on researchers using data from these sources. To ensure the consistency and intellectual coherence of their corpus, they may need to compile an ideal list of documents based on well-defined criteria, and then collect the items on that list from different sources. That strategy will ensure that most biases in the corpus will arise from a known source: the criteria used by the researchers to compile the list. Those willing to accept this extra burden will be able to build new, potentially ground-breaking corpora in an intellectually rigorous way.

Text Data Sources Available to Penn Affiliates

TDM Platforms for Historic Newspaper & Text Data

For a detailed guide on using the Platforms, please see: Text Analysis Using TDM Platforms at Penn Libraries


A corpus is a collection of written texts, particularly the entire body of work on a subject or by a specific creator; a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.

Penn Libraries Home Franklin Home
(215) 898-7555