#s-lg-box-27454552-container #s-lg-col-3 h2.s-lib-box-title {display: block;} Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Image at top shows a map of demographic data for Philadelphia

Text Analysis at Penn Libraries

A guide to text mining tools and methods

Guide

  • Accessing UPenn Proquest TDM
    1. Anyone with a valid UPenn email address can request access to a workbench. To request a workbench, please send an email to please contact the Research Data & Digital Scholarship team or your subject specialist, with names and email addresses of everyone who would like to access the same workbench. You can always add new users at a later time. By default, each workbench can support 1-5 users.
    2. Once your workbench is created, you will receive a confirmation email that includes instructions on how to get started. To access your account and reset passwords, visit https://tdmstudio.proquest.com/home and login with the email address that you initially provided, and select  “Forgot Your Password”. You will be sent a forgot password link which can be used to set your password.
    3. A one-page quick start guide can be found here: ProQuest LibGuides Quick Start
  • Create Dataset
    1. You can create a maximum of 10 datasets of up to 2,000,000 documents. You can search for specific Publication Titles or in licensed ProQuest databases, including ProQuest Historical Newspapers.
    2. Once you have clicked on "Create Dataset", you are returned to your dashboard and the dataset you just defined displays with the status of “Queued". TDM Studio processes 100,000 of documents an hour. Once your dataset is complete, it will show a status as "Completed". 
    3. A one-page dataset guide can be found here: ProQuest LibGuides Dataset Creation
  • Analyze Dataset
    1. When you're ready to use a dataset, turn on the Jupyter Notebook environment. It can take up to 10 minutes for this process to complete. We recommend starting with the Start Here.ipynb file, which will help you visually select and transfer the datasets you would like to use for analysis.
  • Python / R Scripts / Templates
    1. Once you have opened the Jupyter Notebook environment, you will find detailed user manuals, tips in using the workbench, and collection of sample code to get you started.  We recommend starting with the Start Here.ipynb or the folder named ProQuest TDM Studio Samples.
    2. You can also upload your own scripts to the Jupyter notebook environment for data processing and analysis.
  • Export Data for further use / analysis
    1. You can export any derived data, as well as scripts, tables and visualizations. Due to copyright restrictions, you cannot export full text or any consumptive information that could be used to reconstruct the full text. The current export limit is 15 MB per week.

Vocabulary

  • Dataset / Corpus: Collection of data that can be parsed individually (understood) by a computer
  • Scripts: Instructions that are executed by a program in a specific order to accomplish a task
  • Methods: Procedures used to analyze data
  • Jupyter Notebook: Open-source software that enables users to share documents with text, code, visualizations
  • Python: Open-source programming language
  • R: Open-source programming language and software for statistical analysis
  • Workbench Dashboard: The Workbench is designed for experienced researchers who use their own coding methodologies. Workbenches are available to researchers in the designated project.
  • Visualization Dashboard: TDM Studio Visualization is designed for users of all levels to quickly spot trends and generate insights. 
  • Document: A single text file - novel, tweet, article
  • Vocabulary: All the words used in a document or corpus
  • Term/word frequency: Usage of terms in document / corpus in terms of the entire terms in a corpus
  • n-grams: A sequence of items from a given sample of text or speech - usually unigrams, bigrams (two item phrases), trigrams (three item phrases)

Links to More Resources:

Analyzing and Visualizing Text with Constellate and ProQuest TDM Studio: A guide introducing text analysis as a research method and a demonstration of Constellate and ProQuest TDM Studio. The guide contains slides, a recorded workshop, and instructions for using the platforms. 

Slides 

Videos

Multilingual Resources

ProQuest TDM LibGuide

ProQuest Website

Support Articles

Demo Webinar

Related UPenn Infoguides: 

Glossary