Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

Text and Data Mining Platforms

See the Text Analysis Platforms Tasting Menu by University of Toronto for a comparison between various TDM Platforms.

Please share any feedback you have about your experience at LibraryRDDS@upenn.edu.

Proquest TDM Studio

ProQuest TDM Studio allows researchers to mine and computationally analyze large volumes of published content from over 150 databases and 50,000 publications, including magazines, books, conference papers, dissertations and theses, scholarly journals, and current and historical newspapers like Wall Street Journal, New York Times, and Washington Post. Researchers can work together in a coding workbench to

  • build a corpus
  • conduct data analysis, text mining, and visualization
  • search, find patterns, discover relationships, and analyze large amounts of content

Anyone with a valid UPenn email address can access a workbench by logging in. By default, each workbench can support 1-5 users. 

Constellate

Constellate is an exploratory platform for text analysis that allows users to mine, visualize, and computationally analyze content from major collections like JSTOR, Portico, Chronicling America, DocSouth, and RevealDigital. Create a ready-to-analyze dataset with point-and-click ease from over 30 million documents, including primary and secondary texts relevant to every discipline and perfect for learning text analytics or conducting original research. 

The platform provides the content and tools you need together in one place, alongside a defined curriculum and robust tutorials, live classes taught by text analysis experts, and a community you can connect to for inspiration and guidance.

Resources:

More information about Constellate can be found on Constellate webpage

Glossary

  • Dataset / Corpus: Collection of data that can be parsed individually (understood) by a computer
  • Scripts: Instructions that are executed by a program in a specific order to accomplish a task
  • Methods: Procedures used to analyze data
  • Jupyter Notebook: Open-source software that enables users to share documents with text, code, visualizations
  • Python: Open-source programming language
  • R: Open-source programming language and software for statistical analysis
  • Workbench Dashboard: The Workbench is designed for experienced researchers who use their own coding methodologies. Workbenches are available to researchers in the designated project.
  • Visualization Dashboard: TDM Studio Visualization is designed for users of all levels to quickly spot trends and generate insights. 
  • Document: A single text file - novel, tweet, article
  • Vocabulary: All the words used in a document or corpus
  • Term/word frequency: Usage of terms in document / corpus in terms of the entire terms in a corpus
  • n-grams: A sequence of items from a given sample of text or speech - usually unigrams, bigrams (two item phrases), trigrams (three item phrases)
Penn Libraries Home Search the Catalog
(215) 898-7555