#s-lg-box-27454552-container #s-lg-col-3 h2.s-lib-box-title {display: block;} Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Image at top shows a map of demographic data for Philadelphia

Text Analysis at Penn Libraries

A guide to text mining tools and methods

Software for Text Analysis

Once you have built your corpus, you will need to use specialized software to analyze it. Different kinds of software are suited to different disciplines and research questions. The software listed below do not require programming language knowledge.

  • Voyant Tools: Voyant tool is an open-source, web-based text reading and analysis environment.
  • MALLET:  MAchine Learning for LanguagE Toolkit is a Java programming language-based software for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. 
  • AntConc: (Tutorial) A freeware corpus analysis toolkit for concordancing and finding clusters (frequency patterns of word sequences) or n-grams (sequences of n words within your corpus or document).
  • Google Pinpoint: Part of Google’s Journalist Studio, search keywords and identify entities in large amounts of text.

Text and Data Mining Platforms

Proquest TDM Studio

ProQuest TDM Studio allows researchers to mine and computationally analyze large volumes of published content from from over 150 databases and 50,000 publications, including magazines, books, conference papers, dissertations and theses, scholarly journals, and current and historical newspapers like Wall Street Journal, New York Times, and Washington Post. Researchers can work together in a coding workbench to

  • build a corpus
  • conduct data analysis, text mining, and visualization
  • search, find patterns, discover relationships, and analyze large amounts of content

Anyone with a valid UPenn email address can access a workbench by logging in. By default, each workbench can support 1-5 users. 


Constellate is an exploratory platform for text analysis that allows users to mine, visualize, and computationally analyze content from major collections like JSTOR, Portico, Chronicling America, DocSouth, and RevealDigital. Create a ready-to-analyze dataset with point-and-click ease from over 30 million documents, including primary and secondary texts relevant to every discipline and perfect for learning text analytics or conducting original research. Constellate is in beta and still under development.

More information about Constellate can be found here