Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

Applied Data Science Librarian

Profile Photo
Jajwalya Karajgikar
Subjects: Statistics


This guide was created by Rachel Liu, an Assistant for Research Data and Digital Scholarship Text and Data Mining. Rachel is currently a graduate student majoring in Learning Sciences and Technologies.

R / RStudio

A companion to our R/RStudio Libguide, this guide will take you through how to use several text analysis tools using R.

R is a statistical programming language that can be used in text analysis. RStudio is the Integrated Development Environment (IDE) for working on R projects.

Interested in learning more about R? Check out our upcoming workshops and events!

RPenn Group

Date: Second Thursday of the Month
Time:12:00pm - 1:00pm
Campus: Van Pelt-Dietrich Library Center
Collaborative Classroom (Room 113)

Packages in R

Data Cleaning Packages

Stringr Package for R:

  • Used for manipulating and working with character strings in R
  • Provides a set of functions that make it easier to perform common string operations such as searching, splitting, and replacing strings
  • Designed to be lightweight and fast, with a focus on performance
  • Provides a useful alternative to the base R string manipulation functions
  • Example: Text Processing in R
  • Documentation/Tutorial: Documentation, Introduction to Stringr

Tidyverse Package for R:

  • Collection of packages designed for data science and analysis
  • Includes core packages such as dplyr, ggplot2, tidyr, and readr, among others
  • Provides consistent syntax and data structures for working with data
  • Offers tools for data manipulation, visualization, and exploration
  • Facilitates the data cleaning process with functions for missing value imputation, variable transformation, and more
  • Emphasizes the principles of tidy data, making it easier to work with and analyze data
  • Example: Text Analytics with R: Classification
  • Documentation/Tutorial: DocumentationText Mining in R Section: Exploratory Text Analysis

Data Pre-Processing Packages

Quanteda package for R:

koRpus Package for R:

  • Used for text analysis in various languages
  • Provides tools for text data preprocessing, analysis, and visualization
  • Includes advanced tools for annotation, lemmatization, and error correction of text data
  • Offers support for multiple languages
  • Provides a variety of plotting functions for creating visualizations of text data
  • Specifically designed for working with text data, so it may be more efficient and user-friendly than more general-purpose R packages
  • Example: An analysis of Peter Pan using the R package koRpus
  • Documentation/Tutorial: DocumentationUsing the koRpus Package for Text Analysis

Spacyr Package for R:

  • Provides functionality for tokenizing, lemmatizing, and part-of-speech tagging of text data
  • Includes support for named entity recognition, dependency parsing, and sentence boundary detection
  • Offers functionality for training custom models on user-specific data
  • Provides integration with other R packages for advanced text analysis workflows
  • Particularly useful for handling large datasets and complex text analysis tasks
  • Example: Sentiment Analysis on the COVID-19 Update Speeches
  • Documentation/Tutorial: DocumentationA Guide to Using spacyr

Tidytext Package for R:

  • Tokenizes, stems, and counts words in text data
  • Performs sentiment analysis and topic modeling
  • Integrates with other tidyverse packages for streamlined data analysis workflows
  • Utilizes n-grams and TF-IDF to explore word relationships and context
  • Provides pre-processed datasets and extensive documentation with practical examples
  • Example: tidytext: Text mining using tidy tools
  • Documentation/Tutorial: DocumentationIntroduction to tidytext

Tokenizer Package for R:

  • Provides functions for tokenizing text data, which involves breaking text into smaller units such as words or phrases
  • Supports various tokenization methods, including word, character, and n-gram tokenization
  • Offers the ability to remove punctuation and stopwords from the text during tokenization
  • Provides options for customizing tokenization rules, such as specifying a minimum and maximum token length or using regular expressions
  • Supports tokenization of multiple languages, including English, French, and German
  • Example: Chapter 2 Tokenization
  • Documentation/Tutorial: DocumentationIntroduction to the tokenizers Package

Topic Modeling Packages

Text2vec Package for R:

  • Used for large-scale text data analysis, especially in the context of machine learning and deep learning models
  • Provides tools for text data preprocessing, including tokenization, normalization, and stopword removal
  • Offers a variety of methods for creating document-term matrices and word embeddings
  • Includes functions for performing dimensionality reduction, topic modeling, and clustering on text data
  • Specifically designed for scalability, so it can handle large text datasets with high efficiency
  • Example: Predictive Modeling With Text Features
  • Documentation/Tutorial: DocumentationAnalyzing Texts with the text2vec package

lda Package for R:

  • Conducts interactive topic modeling and text analysis
  • Offers a range of algorithms, including the widely used LDA algorithm
  • Provides text preprocessing functions, such as cleaning, stemming, and stopword removal
  • Provides the ability to customize the number of topics and to incorporate covariates into the model
  • Note: LDAvis Package for R provides interactive visualization tools for exploring and interpreting the results of topic modeling and text analysis
  • Example: lda: Collapsed Gibbs Sampling Methods for Topic Models
  • Documentation/Tutorial: DocumentationTopic Modeling with R

STM Package for R:

Topicmodels Package for R:

  • Fits various topic models, including LDA and CTM
  • Evaluates topic model quality using metrics such as perplexity, coherence, and topic distribution
  • Supports parallel computing for faster model fitting and evaluation
  • Customizes the Gibbs sampling algorithm used in the LDA model
  • Predicts topic distribution of new documents based on fitted models
  • Provides visualization tools such as word clouds and dendrograms for data exploration
  • Example: NLP in R: Topic Modelling
  • Documentation/Tutorial: DocumentationTopic modeling

MALLET Package for R:

  • Provides tools for topic modeling and other natural language processing tasks
  • Implements several topic modeling algorithms, including Latent Dirichlet Allocation (LDA) and Hierarchical LDA
  • Supports input data in various formats, including plain text, CSV, and JSON
  • Offers functions for text preprocessing, such as tokenization, stopword removal, and stemming
  • Provides options for customizing topic models, such as specifying the number of topics and tuning hyperparameters
  • Supports advanced features, such as parallel processing and visualizations of topic models
  • Example: mallet.Rmd
  • Documentation/Tutorial: DocumentationIntroduction to R mallet

Sentiment Analysis Packages

SentimentAnalysis package for R:

  • Analyzes sentiment of text data using machine learning
  • Classifies text as positive, negative, or neutral
  • Supports multiple languages
  • Includes pre-built classifiers or custom classifiers using user-provided training data
  • Functions for tokenizing, cleaning, and preparing text data
  • Handles various data formats
  • Useful for sentiment analysis in customer feedback, social media monitoring, and market research
  • Example: SentimentAnalysis Vignette
  • Documentation/Tutorial: DocumentationSentiment Analysis

cleanNLP Package for R:

  • Cleans and preprocesses text data, including removing stop words, stemming, and tokenizing text
  • Performs part-of-speech tagging, named entity recognition, and dependency parsing to extract relevant information from text data
  • Supports multiple languages
  • Integrates with machine learning packages, including caret, for sentiment analysis and text classification
  • Supports various input data types, including text files, character vectors, and data frames
  • Example: cleanNLP: A Tidy Data Model for Natural Language Processing
  • Documentation/Tutorial: DocumentationIntroduction to the cleanNLP package

Other Useful Packages

TM Package for R:

  • Provides tools for creating a corpus, cleaning and preprocessing text data, and conducting analyses such as topic modeling, clustering, and sentiment analysis
  • Offers a range of functions for text transformation, including stemming, stopword removal, and n-gram creation
  • Provides visualization tools, such as word clouds and frequency distributions, to aid in data exploration
  • Offers a variety of methods for text classification and prediction, such as machine learning algorithms and support vector machines
  • Example: Using the TM package
  • Documentation/Tutorial: DocumentationIntroduction to the TM Package Text Mining in R

 LSAfun Package for R:

  • Can be used for tasks such as text classification, information retrieval, and text summarization
  • Implements Latent Semantic Analysis (LSA) in R for text analysis and natural language processing
  • Supports both unsupervised and supervised learning of LSA models
  • Provides tools for dimensionality reduction and feature extraction from text data
  • Includes functions for calculating cosine similarity and building similarity matrices
  • Builds on the basic LSA functionality provided by LSA package by adding features like normalization, weighting, dimensionality reduction, and statistical analysis of the results. It also includes functions for feature selection and topic modeling
  • Example: Latent Semantic Analysis
  • Documentation/Tutorial: DocumentationA Guide to Text Analysis with Latent Semantic Analysis

RWeka Package for R:

  • R package for machine learning with Weka, a popular Java-based machine learning toolkit
  • Supports a wide variety of data mining and machine learning tasks, including classification, regression, clustering, and association rule mining
  • Offers a variety of feature selection and feature extraction methods
  • Provides tools for evaluating model performance and conducting cross-validation
  • Example: RWeka Odds and Ends
  • Documentation/Tutorial: DocumentationMachine learning with R

tsne Package for R:

  • Implements t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and data visualization
  • Reduces high-dimensional data to two or three dimensions for easy visualization and interpretation
  • Supports different distance metrics for measuring similarity between data points, including Euclidean distance and cosine distance
  • Provides options for controlling the number of iterations and perplexity parameters to adjust the quality of the visualization
  • Offers parallelization capabilities for faster computation on large datasets
  • Can handle both numeric and categorical data, making it suitable for various applications
  • Example: Getting started with t-SNE for biologist (R)
  • Documentation/Tutorial: DocumentationHow To Make tSNE plot in R

Wordcloud Package for R:

  • Used for generating word clouds, which are visual representations of the most frequently occurring words in a corpus of text
  • Provides functions for creating customizable word clouds, including the ability to change font size, color, and orientation of words
  • Offers options for preprocessing text data, such as removing stopwords and stemming words
  • Supports a variety of input data types, including character vectors, text files, and corpora created with other R packages
  • Provides additional features, such as the ability to mask word clouds with custom shapes and to weight words by their frequency or other user-defined values
  • Example: How to Make a Wordcloud Using R
  • Documentation/Tutorial: DocumentationHow to Generate Word Clouds in R

Source: Text mining and wordcloud with R (

Penn Libraries Home Franklin Home
(215) 898-7555