It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
This is a catalog of text data sources available to researchers at Penn. The data from these sources has not been gathered with any one project in mind, and may be biased in unpredictable ways. It's like data from a poll that allows voters to vote as often as they like: it might be reliable for a particular project, but it might not be, and there's not an easy way to tell.
That places an additional burden on researchers using data from these sources. To ensure the consistency and intellectual coherence of their corpus, they may need to compile an ideal list of documents based on well-defined criteria, and then collect the items on that list from different sources. That strategy will ensure that most biases in the corpus will arise from a known source: the criteria used by the researchers to compile the list. Those willing to accept this extra burden will be able to build new, potentially ground-breaking corpora in an intellectually rigorous way.
Full-text and metadata from JSTOR resources is now available for visualizing and analyzing within JSTOR's platform. For more information on Constellate, please contact the Research Data & Digital Scholarship team.
Penn Libraries can provide access to full text data from numerous Gale research databases including ECCO, the Gale NewsVault, and Archives Unbound. For details about holdings and access procedures, please contact the Research Data & Digital Scholarship team.
Plain text data from LexisNexis is now available through an app under development by Penn Libraries. If you have a project in mind for this data, please contact the Research Data & Digital Scholarship team.
Full-text data from ProQuest databases is now available for visualizing and analyzing within the ProQuest platform. If you have a project in mind for this data, please contact the Research Data & Digital Scholarship team.
The HathiTrust library collects data from more than fourteen million books, and is an extraordinary source of text data for historical research. Full text and page images are available through an API, and can also be downloaded in bulk.
These sixteen collections were built by the Documenting the American South project at UNC Chapel Hill. Including collections of first-person narratives, oral histories, and state records, this is a rich resource for building specific-purpose corpora.
Co-created by Donald Sturgeon and users worldwide, this large database of full-text Chinese classics is free for anyone to use, and provides a newly created API for downloading texts and creating applications that link to the database.