Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

What is Topic Modeling?

Topic modeling is a type of statistical modeling used to identify topics or themes within a collection of documents. It involves automatically clustering words that tend to co-occur frequently across multiple documents, with the aim of identifying groups of words that represent distinct topics. The ultimate goal is to identify the underlying themes or topics that run through a large corpus of text data.

When to Use Topic Modeling?

When considering which analytical method to use for text data, topic modeling can provide perplexing and bursty insights. Topic modeling has numerous applications, including:

1. Document classification: categorizing documents based on their content.

2. Information retrieval: assisting search engines in finding the most relevant documents for a given query.

3. Text summarization: condensing a large piece of writing into a shorter summary.

4. Customer segmentation: grouping customers based on their feedback or reviews.

5. Sentiment analysis: determining whether a large collection of text is positive, negative, or neutral in tone.

6. Exploratory data analysis: discovering hidden patterns and themes in a large corpus of text data.


Here are some examples of research topics and questions for social studies that could potentially benefit from the use of topic modeling:

1. History:

  • Analyzing the discourse surrounding the American Civil War in historical texts and popular media.
  • Exploring the evolution of history education over time.

2. Sociology:

  • Analyzing the usage of terminology and stereotypes related to gender in newspapers and popular media.
  • Exploring the most prevalent issues and trends discussed in Reddit forums related to mental health.

3. Political Science:

  • Analyzing the main policy issues and their implications in the US presidential debates through the analysis of transcripts.
  • Studying the patterns of communication and negotiation strategies among diplomats using official documents.

4. Psychology:

  • Identifying common mental health issues discussed on patient forums and analyzing their sentiment.
  • Examining the language used in therapy sessions to identify patterns of communication.

5. Economics:

  • Studying the patterns of discussion around inflation in financial news articles.
  • Identifying the most prominent economic trends and events discussed in popular media.

6. Education:

  • Identifying the most discussed topics in education-related Twitter conversations among teachers and policymakers.
  • Analyzing the use of technology in education research articles.

7. Communication Studies:

  • Analyzing the most popular topics and themes in TED Talks through transcript analysis.
  • Identifying the most prevalent communication strategies used by political campaign advertisements on social media.

Topic modeling using Voyant

Voyant is an online tool for text analysis that can be used for a variety of tasks, including topic modeling. Here is a concise guide on how to do topic modeling with Voyant:

  1. To begin, upload your text corpus to Voyant. You can do this by either pasting your text or uploading a file in various formats, such as TXT, HTML, or PDF.
  2. Next, use the "Topics" tool within the "Document Tools" option to access the topic modeling interface.

        3. Choose the number of topics you want to generate using the slider, ranging from 1 to 200 (default is 25). You can also search for words or part words displayed in the topics using the search box.

        4. If necessary, exclude stopwords using the "Options" icon. You can also modify the maximum number of terms per document to use for topic modeling, but be mindful of potential problems with the server and browser depending on the corpus size.

        5. Once the topic modeling is complete, Voyant will display the topics and associated words. You can click on a topic to view the documents and words most strongly associated with it.


Discover the interactive topic modeling window in Voyant!

Topic modeling using Programming Languages Python and R

The general process of topic modeling in R and Python includes:

  1. Import the necessary libraries: Import the necessary libraries for text processing and topic modeling. Some popular options in Python include NLTK, Gensim, and Scikit-learn in Python; Popular options in R include Text2vec, lda, STM, Topicmodels, and Mallet, more information can be found here.
  2. Load the data.
  3. Preprocess the data: Clean the data by removing stop words, punctuation, and other non-relevant information. This step may also involve stemming or lemmatization to reduce the number of unique words in the data.
  4. Create a document-term matrix: Convert the preprocessed text data into a matrix of word counts, where each row represents a document and each column represents a unique word.
  5. Run the topic modeling algorithm: There are several popular topic modeling algorithms, including Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). Run the algorithm on the document-term matrix to identify the underlying topics or themes in the data.
  6. Interpret the results: Review the output of the topic modeling algorithm to identify the topics or themes that were identified, and review the most relevant words for each topic to understand the underlying concepts.
  7. Refine the model: Depending on the quality of the results, you may need to refine the model by adjusting the algorithm parameters or preprocessing steps. This may involve iterating on steps 5-6 until you achieve the desired results.



Penn Libraries Home Franklin Home
(215) 898-7555