Guides: Text Analysis: Sources of Text Data

Sources of Text Data

This is a catalog of text data sources available to researchers at Penn. The data from these sources has not been gathered with any one project in mind, and may be biased in unpredictable ways. It's like data from a poll that allows voters to vote as often as they like: it might be reliable for a particular project, but it might not be, and there's not an easy way to tell.

That places an additional burden on researchers using data from these sources. To ensure the consistency and intellectual coherence of their corpus, they may need to compile an ideal list of documents based on well-defined criteria, and then collect the items on that list from different sources. That strategy will ensure that most biases in the corpus will arise from a known source: the criteria used by the researchers to compile the list. Those willing to accept this extra burden will be able to build new, potentially ground-breaking corpora in an intellectually rigorous way.

Text Data Sources Available to Penn Affiliates

TDM Platforms for Historic Newspaper & Text Data

For a detailed guide on using the Platforms, please see: Text Analysis Using TDM Platforms at Penn Libraries

Constellate
Full-text and metadata from JSTOR resources is now available for visualizing and analyzing within JSTOR's platform.
Gale Text Data
Penn Libraries can provide access to full text data from numerous Gale research databases including ECCO, the Gale NewsVault, and Archives Unbound.
Proquest TDM Studio
Full-text data from ProQuest databases is available for visualizing and analyzing within the ProQuest platform.

Corpus

A corpus is a collection of written texts, particularly the entire body of work on a subject or by a specific creator; a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.

Library Databases (Contact your Subjects Specialists)
Open Source Search:
- re3data.org
- datasetsearch.research.google.com
- huggingface.co
- kaggle.com/datasets
- github.com/awesomedata/awesome-public-datasets
Web Scraping Data: Copying website information in order to extract large amounts of data and saving to a local file is web crawling or spider or scraping.
- Please note that not all online resources allow text mining and that there are legal and ethical limitations to consider.
- Scraping Open Data from the Web with BeautifulSoup Workshop material blogpost
Social Media (Twitter) Data Mining: Social media data access and availability depends on the platform(s) and time period of your research. There are legal and ethical considerations when it comes to social media data mining.
- Starting February 13, 2023, Twitter will no longer support free access to the Twitter API (v1.1. and v2). See the Twitter Dev team's February 2 announcement and their February 8 update. This page will be updated as more information becomes available.
- For more information, please see Twitter Data, GMU Libguide
  - TweePy using Python
  - Social Feed Manager
  - rTweet using R
  - DocNow
Extracting Texts:
- Perform Optical Character Recognition (OCR) using ABBYY Finereader at the Butler Assistive Technology Room.
- Textract using Python extracts text from docx, pdfs, images and sound.
- Tesseract is a free software to OCR your documents. Using Tesseract requires experience using the command line.
- More resources:
  - ABBYY FineReader Tutorial, NYU Libraries Scholarly Communications and Information Policy Department
  - Introduction to OCR and Searchable PDFs, Illinois Library