Guides: R for Business Guide: Perform text analysis

Search the Business FAQ

Quick start

Text Mining with R

Text Mining and Sentiment Analysis

Introduction to Text Mining with R

Packages

Data Cleaning Packages

Stringr Package for R:

Used for manipulating and working with character strings in R
Provides a set of functions that make it easier to perform common string operations such as searching, splitting, and replacing strings
Designed to be lightweight and fast, with a focus on performance
Provides a useful alternative to the base R string manipulation functions
Example: Text Processing in R
Documentation/Tutorial: Documentation, Introduction to Stringr

Tidyverse Package for R:

Collection of packages designed for data science and analysis
Includes core packages such as dplyr, ggplot2, tidyr, and readr, among others
Provides consistent syntax and data structures for working with data
Offers tools for data manipulation, visualization, and exploration
Facilitates the data cleaning process with functions for missing value imputation, variable transformation, and more
Emphasizes the principles of tidy data, making it easier to work with and analyze data
Example: Text Analytics with R: Classification
Documentation/Tutorial: Documentation, Text Mining in R Section: Exploratory Text Analysis

Data Pre-Processing Packages

Quanteda package for R:

Text data pre-processing and mining features including tokenization, stemming, lemmatization, and n-gram extraction
Visualization tools including word clouds, dendrograms, and heatmaps
Built-in support for creating document-feature matrices and topic models
Supports multiple languages including English, Spanish, French, and German
Example: Textual data visualization, Advancing Text Mining with R and quanteda
Documentation/Tutorial: Documentation, Quanteda Tutorials

koRpus Package for R:

Used for text analysis in various languages
Provides tools for text data preprocessing, analysis, and visualization
Includes advanced tools for annotation, lemmatization, and error correction of text data
Offers support for multiple languages
Provides a variety of plotting functions for creating visualizations of text data
Specifically designed for working with text data, so it may be more efficient and user-friendly than more general-purpose R packages
Example: An analysis of Peter Pan using the R package koRpus
Documentation/Tutorial: Documentation, Using the koRpus Package for Text Analysis

Spacyr Package for R:

Provides functionality for tokenizing, lemmatizing, and part-of-speech tagging of text data
Includes support for named entity recognition, dependency parsing, and sentence boundary detection
Offers functionality for training custom models on user-specific data
Provides integration with other R packages for advanced text analysis workflows
Particularly useful for handling large datasets and complex text analysis tasks
Example: Sentiment Analysis on the COVID-19 Update Speeches
Documentation/Tutorial: Documentation, A Guide to Using spacyr

Tidytext Package for R:

Tokenizes, stems, and counts words in text data
Performs sentiment analysis and topic modeling
Integrates with other tidyverse packages for streamlined data analysis workflows
Utilizes n-grams and TF-IDF to explore word relationships and context
Provides pre-processed datasets and extensive documentation with practical examples
Example: tidytext: Text mining using tidy tools
Documentation/Tutorial: Documentation, Introduction to tidytext

Tokenizer Package for R:

Provides functions for tokenizing text data, which involves breaking text into smaller units such as words or phrases
Supports various tokenization methods, including word, character, and n-gram tokenization
Offers the ability to remove punctuation and stopwords from the text during tokenization
Provides options for customizing tokenization rules, such as specifying a minimum and maximum token length or using regular expressions
Supports tokenization of multiple languages, including English, French, and German
Example: Chapter 2 Tokenization
Documentation/Tutorial: Documentation, Introduction to the tokenizers Package

Topic Modeling Packages

Text2vec Package for R:

Used for large-scale text data analysis, especially in the context of machine learning and deep learning models
Provides tools for text data preprocessing, including tokenization, normalization, and stopword removal
Offers a variety of methods for creating document-term matrices and word embeddings
Includes functions for performing dimensionality reduction, topic modeling, and clustering on text data
Specifically designed for scalability, so it can handle large text datasets with high efficiency
Example: Predictive Modeling With Text Features
Documentation/Tutorial: Documentation, Analyzing Texts with the text2vec package

lda Package for R:

Conducts interactive topic modeling and text analysis
Offers a range of algorithms, including the widely used LDA algorithm
Provides text preprocessing functions, such as cleaning, stemming, and stopword removal
Provides the ability to customize the number of topics and to incorporate covariates into the model
Note: LDAvis Package for R provides interactive visualization tools for exploring and interpreting the results of topic modeling and text analysis
Example: lda: Collapsed Gibbs Sampling Methods for Topic Models
Documentation/Tutorial: Documentation, Topic Modeling with R

STM Package for R:

Provides data preprocessing functions, such as cleaning, stemming, and stopword removal
Offers a range of models for topic modeling, including the widely used structural topic model
The structural topic model identifies latent topics and their prevalence in each document
Users can preprocess text data to create a document-term matrix
Example: An Introduction to the Structural Topic Model (STM)
Documentation/Tutorial: Documentation, Structural Topic Models: stm R package

Topicmodels Package for R:

Fits various topic models, including LDA and CTM
Evaluates topic model quality using metrics such as perplexity, coherence, and topic distribution
Supports parallel computing for faster model fitting and evaluation
Customizes the Gibbs sampling algorithm used in the LDA model
Predicts topic distribution of new documents based on fitted models
Provides visualization tools such as word clouds and dendrograms for data exploration
Example: NLP in R: Topic Modelling
Documentation/Tutorial: Documentation, Topic modeling

MALLET Package for R:

Provides tools for topic modeling and other natural language processing tasks
Implements several topic modeling algorithms, including Latent Dirichlet Allocation (LDA) and Hierarchical LDA
Supports input data in various formats, including plain text, CSV, and JSON
Offers functions for text preprocessing, such as tokenization, stopword removal, and stemming
Provides options for customizing topic models, such as specifying the number of topics and tuning hyperparameters
Supports advanced features, such as parallel processing and visualizations of topic models
Example: mallet.Rmd
Documentation/Tutorial: Documentation, Introduction to R mallet

Sentiment Analysis Packages

SentimentAnalysis package for R:

Analyzes sentiment of text data using machine learning
Classifies text as positive, negative, or neutral
Supports multiple languages
Includes pre-built classifiers or custom classifiers using user-provided training data
Functions for tokenizing, cleaning, and preparing text data
Handles various data formats
Useful for sentiment analysis in customer feedback, social media monitoring, and market research
Example: SentimentAnalysis Vignette
Documentation/Tutorial: Documentation, Sentiment Analysis

cleanNLP Package for R:

Cleans and preprocesses text data, including removing stop words, stemming, and tokenizing text
Performs part-of-speech tagging, named entity recognition, and dependency parsing to extract relevant information from text data
Supports multiple languages
Integrates with machine learning packages, including caret, for sentiment analysis and text classification
Supports various input data types, including text files, character vectors, and data frames
Example: cleanNLP: A Tidy Data Model for Natural Language Processing
Documentation/Tutorial: Documentation, Introduction to the cleanNLP package

Other Useful Packages

TM Package for R:

Provides tools for creating a corpus, cleaning and preprocessing text data, and conducting analyses such as topic modeling, clustering, and sentiment analysis
Offers a range of functions for text transformation, including stemming, stopword removal, and n-gram creation
Provides visualization tools, such as word clouds and frequency distributions, to aid in data exploration
Offers a variety of methods for text classification and prediction, such as machine learning algorithms and support vector machines
Example: Using the TM package
Documentation/Tutorial: Documentation, Introduction to the TM Package Text Mining in R

LSAfun Package for R:

Can be used for tasks such as text classification, information retrieval, and text summarization
Implements Latent Semantic Analysis (LSA) in R for text analysis and natural language processing
Supports both unsupervised and supervised learning of LSA models
Provides tools for dimensionality reduction and feature extraction from text data
Includes functions for calculating cosine similarity and building similarity matrices
Builds on the basic LSA functionality provided by LSA package by adding features like normalization, weighting, dimensionality reduction, and statistical analysis of the results. It also includes functions for feature selection and topic modeling
Example: Latent Semantic Analysis
Documentation/Tutorial: Documentation, A Guide to Text Analysis with Latent Semantic Analysis

RWeka Package for R:

R package for machine learning with Weka, a popular Java-based machine learning toolkit
Supports a wide variety of data mining and machine learning tasks, including classification, regression, clustering, and association rule mining
Offers a variety of feature selection and feature extraction methods
Provides tools for evaluating model performance and conducting cross-validation
Example: RWeka Odds and Ends
Documentation/Tutorial: Documentation, Machine learning with R

tsne Package for R:

Implements t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and data visualization
Reduces high-dimensional data to two or three dimensions for easy visualization and interpretation
Supports different distance metrics for measuring similarity between data points, including Euclidean distance and cosine distance
Provides options for controlling the number of iterations and perplexity parameters to adjust the quality of the visualization
Offers parallelization capabilities for faster computation on large datasets
Can handle both numeric and categorical data, making it suitable for various applications
Example: Getting started with t-SNE for biologist (R)
Documentation/Tutorial: Documentation, How To Make tSNE plot in R

Wordcloud Package for R:

Used for generating word clouds, which are visual representations of the most frequently occurring words in a corpus of text
Provides functions for creating customizable word clouds, including the ability to change font size, color, and orientation of words
Offers options for preprocessing text data, such as removing stopwords and stemming words
Supports a variety of input data types, including character vectors, text files, and corpora created with other R packages
Provides additional features, such as the ability to mask word clouds with custom shapes and to weight words by their frequency or other user-defined values
Example: How to Make a Wordcloud Using R
Documentation/Tutorial: Documentation, How to Generate Word Clouds in R

Source: Text mining and wordcloud with R (https://r-graph-gallery.com/102-text-mining-and-wordcloud.html)

Business & Data Analysis Librarian

Kevin Thomas

He/Him/His

Email Me

Subjects: Statistics

Chat

Credits

Content on this page was collected by Rachel Liu, an Assistant for Research Data and Digital Scholarship Text and Data Mining. Rachel is currently a graduate student majoring in Learning Sciences and Technologies.