Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

About scattertext

scattertext is a free, open-source library in Python. It is designed to create informative visualizations of the input text. As its name suggests, scattertext creates scatterplots for text. Through the powerful feature of scattertext, we can visualize the characteristics of words and phrases in a given category by plotting distinguishing terms in an HTML scatterplot. 

The Scattertext Google Colab Notebook version of the tutorial is also available if you would like to follow along. 

scattertext is particularly useful when:

  • You want to visualize how words are distributed between two categorical variables.
  • You want to create interactive scatter plots that display distinguishable terms in corpora.
  • You want to find or order characteristics terms or phrases and their associations.

Installation

scattertext (together with its data and models) can be installed using the python package index and setup tools.

Use the following command to install spacy using pip install on your machine:

# [Mac Terminal]
pip3 install -U pip setuptools wheel
pip3 install -U spacy
pip3 install -U scattertext

# [Jupyter Notebook]
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en
!{sys.executable} -m spacy download en_core_web_sm
!{sys.executable} -m pip install scattertext

# [Conda install]
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
canda install scattertext

Statistical Model

To create visualizations using scattertext, we also need to install and import spCy first. spaCy offers statistical models for a variety of languages, which can be installed as individual modules in Python. These models are powerful engines of spaCy that performs several NLP-related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

You can download these models for the English language by executing the following code:

# [Mac Terminal]
python3 -m spacy download en_core_web_lg
python3 -m spacy download en_core_web_sm

# [Jupyter Notebook]
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download en_core_web_lg

# [Conda install]
conda install -c conda-forge spacy-model-en_core_web_sm
conda install -c "conda-forge/label/broken" spacy-model-en_core_web_sm
conda install -c "conda-forge/label/cf202003" spacy-model-en_core_web_sm

Once downloaded, those models can be opened via spacy.load('model_name') in python. Therefore, you can verify if the models were downloaded successfully by running the following code:

import spacy
nlp = spacy.load('en_core_web_sm')

If the nlp object is created, then it means that spaCy was installed and that models and data were successfully downloaded.

Read Strings

For a given input string, you can use spaCy to create a processed object for accessing linguistic annotations:

text = ('For a given input string, you can use spaCy to create a processed object for accessing linguistic annotations.')
text_doc = nlp(text)

The input text string is then converted to an object that spaCy can understand. This method can be used to convert any text into a processed object for future analysis.

Read Text File

You can also convert a .txt file into a processed object. Notice that the .txt file needs to be in the current working directory, or you will have to specify its full path. A quick reminder that you can get the current working directory with os.getcwd() and change it with os.chdir() after importing os.

import os
from google.colab import drive
drive.mount('/content/drive')    # Change the working directory on your own machine as needed with os.chdir('Path to directory')
 
file = 'text.txt'
file_text = open(file).read()
file_doc = nlp(file_text)

You may assume that variable name ending with the suffix _doc are spaCy’s language model objects.

Finding Characteristic Terms and Their Associations

The following code creates a stand-alone HTML file that analyzes words used by Democrats and Republicans in the 2012 party conventions and outputs some notable term associations.

import scattertext as st
import spacy
from pprint import pprint

Next, assemble the data you want to analyze into a Pandas data frame. It should have at least two columns, the text you'd like to analyze, and the category you would like to study. Here, the text column contains convention speeches while the party column contains the party of the speaker.

# Get Dataframe using SampleCorpora from scattertext
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df

We need to turn the data frame into a Scattertext Corpus before analyzing it. To look for differences in parties, set the category_col parameter to 'party', and use the speeches, present in the text column, as the texts to analyze by setting the text col parameter. Finally, pass a spaCy model into the nlp argument and call build() to construct the corpus.

# Create nlp object
nlp = spacy.load("en_core_web_sm")

# Turn it into a Scattertext Corpus
corpus = st.CorpusFromPandas(convention_df,
                             category_col='party',
                             text_col='text',
                             nlp=nlp).build()

To see characteristic terms in the corpus and terms that are most associated with Democrats and Republicans, we can print out the scaled F score. We print out the first 10 that differentiate the corpus from a general English corpus.

print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))

Then, we can print out terms are the terms that are most associated with Democrats:

term_freq_df = corpus.get_term_freq_df()
term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')
print(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))

Visualizing Term Associations

Below is an example of using Scattertext to create visual terms used in the 2012 American political conventions. The 2000 most party-associated unigrams are displayed as points in the scatter plot. Their x- and y- axes are the dense ranks of their usage by Republican and Democratic speakers respectively.

 

import spacy
import pandas as pd
import scattertext as st
from pprint import pprint

# Get Dataframe using SampleCorpora from scattertext
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df

To create visualizations using scattertext, we will first need to create a scattertext corpus of the dataset.

We will set the category_col to 'party' as we are comparing Democratic and Republican, and the parsed column which contains terms used by conventions will be used as parsed_col.

# Create nlp object
nlp = spacy.load("en_core_web_sm")

# Convert to scattertext corpus
corpus = st.CorpusFromPandas(convention_df,
                             category_col='party',
                             text_col='text',
                             nlp=nlp).build()

# Create html visualization
html = st.produce_scattertext_explorer(corpus,
                                       category='democrat',
                                       category_name='Democratic',
                                       not_category_name='Republican',
                                       width_in_pixels=1000,
                                       metadata=convention_df['speaker'])


# Open the html for interactive visualization
# Please run the below code in your interactive window (like Python shell) if you do not see the HTML opening up
open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

Fining Phrase Associations

scattertext can also be used to visualize the category association of a variety of different phrase types. The word "phrase" denotes any single or multi-word collocation.

To use this, we need to install pytextrank via pip install on your local machine before continuing.

# Please run on your local machine
# [Mac Terminal]
pip3 install pytextrank

# [Jupyter Notebook]
!{sys.executable} -m pip install pytextrank

# [Conda install]
conda install -c pytextrank

Next, we need to build a corpus as normal. Note that adding PyTextRank to the spaCy pipeline is not needed, as it will be run separately by the PyTextRankPhrases object. We will reduce the number of phrases displayed in the chart to 2000 using the AssociationCompactor. The phrases generated will be treated like non-textual features since their document scores will not correspond to word counts.

import pytextrank, spacy
import scattertext as st
import numpy as np
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", last=True)

convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: df.text.apply(nlp),
    party=lambda df: df.party.apply({'democrat': 'Democratic', 'republican': 'Republican'}.get)
)

corpus = st.CorpusFromParsedDocuments(
    convention_df,
    category_col='party',
    parsed_col='parse',
    feats_from_spacy_doc=st.PyTextRankPhrases()).build()

Note that the terms present in the corpus are named entities, and, as opposed to frequency counts, their scores are the eigencentrality scores assigned to them by the TextRank algorithm. Running corpus.get_metadata_freq_df('') will return, for each category, the sums of terms' TextRank scores. The dense ranks of these scores will be used to construct the scatter plot.

term_category_scores = corpus.get_metadata_freq_df('')

Visualizing with Phrasemachine to Find Phrases

Phrasemachine from AbeHandler (Handler et al. 2016) uses regular expressions over sequences of part-of-speech tags to identify noun phrases. This has the advantage of using spaCy's NP-chunking in that it tends to isolate meaningful, large noun phases which are free of appositives.

As opposed to PyTextRank, we'll just use counts of these phrases, treating them like any other term.

To use phrasemachine to find phrases, we need to install package empath in your local machine by executing the following code:

# Please run on your local machine
# [Mac Terminal]
pip3 install empath

# [Jupyter Notebook]
!{sys.executable} -m pip install empath

# [Conda install]
conda install -c empath
import spacy
from scattertext import SampleCorpora, PhraseMachinePhrases, dense_rank, RankDifference, AssociationCompactor, produce_scattertext_explorer
from scattertext.CorpusFromPandas import CorpusFromPandas

# Convert to scattertext corpus
corpus = (CorpusFromPandas(SampleCorpora.ConventionData2012.get_data(),
    category_col='party',
    text_col='text',
    feats_from_spacy_doc=PhraseMachinePhrases(),
    nlp=spacy.load("en_core_web_sm")).build().compact(AssociationCompactor(4000)))

# Create html visualization
html = produce_scattertext_explorer(corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    minimum_term_frequency=0,
    pmi_threshold_coefficient=0,
    transform=dense_rank,
    metadata=corpus.get_df()['speaker'],
    term_scorer=RankDifference(),
    width_in_pixels=1000)

# Open the html for interactive visualization
# Please run the below code in your interactive window (like Python shell) if you do not see the HTML opening up
open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

Ordering Terms by Corpus Characteristicness

Often the terms of most interest are ones that are characteristic of the corpus as a whole. These are terms that occur frequently in all sets of documents being studied but are relatively infrequent compared to general term frequencies.

We can produce a plot with a characteristic score on the x-axis and class-association scores on the y-axis using the function produce_characteristic_explorer.

Corpus characteristicness is the difference in dense term ranks between the words in all of the documents in the study and a general English-language frequency list. See this Talk on Term-Class Association Scores for a more thorough explanation.

import scattertext as st

# Convert to scattertext corpus
corpus = (st.CorpusFromPandas(st.SampleCorpora.ConventionData2012.get_data(),
    category_col='party',
    text_col='text',
    nlp=st.whitespace_nlp_with_sentences)
    .build()
    .get_unigram_corpus()
    .compact(st.ClassPercentageCompactor(term_count=2,
    term_ranker=st.OncePerDocFrequencyRanker)))

# Create html visualization
html = st.produce_characteristic_explorer(
    corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    metadata=corpus.get_df()['speaker']
)

# Open the html for interactive visualization
# Please run the below code in your interactive window (like Python shell) if you do not see the HTML opening up
open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

Using Cohen's d or Hedge's r to Visualize Effect Size

Cohen's d is a popular metric used to measure effect size. The definitions of Cohen's d and Hedge's r from (Shinichi and Cuthill 2017) are implemented in Scattertext.

convention_df = st.SampleCorpora.ConventionData2012.get_data()
    corpus = (st.CorpusFromPandas(convention_df,
    category_col='party',
    text_col='text',
    nlp=st.whitespace_nlp_with_sentences).build().get_unigram_corpus())

We can create a term scorer object to examine the effect sizes and other metrics.

term_scorer = st.CohensD(corpus).set_categories('democrat', ['republican'])
term_scorer.get_score_df().sort_values(by='cohens_d', ascending=False).head()

Our calculation of Cohen's d is not directly based on term counts. Rather, we divide each document's term counts by the total number of terms in the document before calculating the statistics. m1 and m2 are, respectively the mean portions of words in speeches made by Democrats and Republicans that were the term in question. The effect size (cohens_d) is the difference between these means divided by the pooled standard deviation. cohens_d_se is the standard error of the statistic, while cohens_d_z and cohens_d_p are the Z-scores and p-values indicating the statistical significance of the effect. Corresponding columns are present for Hedge's r, and unbiased version of Cohen's d.

html = st.produce_frequency_explorer(
    corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    term_scorer=st.CohensD(corpus),
    metadata=convention_df['speaker'],
    grey_threshold=0)


# Open the html for interactive visualization
# Please run the below code in your interactive window (like Python shell) if you do not see the HTML opening up
open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

Developing and Using Bespoke Word Representations

Scattertext can interface with Gensim Word2Vec models. For example, here's a snippet from demo_gensim_similarity.py which illustrates how to train and use a word2vec model on a corpus. Note the similarities produced reflect quirks of the corpus, e.g., "8" tends to refer to the 8% unemployment rate at the time of the convention.

import spacy
from gensim.models import word2vec
from scattertext import SampleCorpora, word_similarity_explorer_gensim, Word2VecFromParsedCorpus
from scattertext.CorpusFromParsedDocuments import CorpusFromParsedDocuments

# Create nlp object
nlp = spacy.load("en_core_web_sm")

# Get Dataset
convention_df = SampleCorpora.ConventionData2012.get_data()
convention_df['parsed'] = convention_df.text.apply(nlp)

# Convert to scattertext corpus
corpus = CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()

# Train and use a word2vec model on a corpus
model = word2vec.Word2Vec(size=300,
    alpha=0.025,
    window=5,
    min_count=5,
    max_vocab_size=None,
    sample=0,
    seed=1,
    workers=1,
    min_alpha=0.0001,
    sg=1,
    hs=1,
    negative=0,
    cbow_mean=0,
    iter=1,
    null_word=0,
    trim_rule=None,
    sorted_vocab=1)

# Create html visualization
html = word_similarity_explorer_gensim(corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    target_term='jobs',
    minimum_term_frequency=5,
    pmi_threshold_coefficient=4,
    width_in_pixels=1000,
    metadata=convention_df['speaker'],
    word2vec=Word2VecFromParsedCorpus(corpus, model).train(),
    max_p_val=0.05,
    save_svg_button=True)


# Open the html for interactive visualization
# Please run the below code in your interactive window (like Python shell) if you do not see the HTML opening up
open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

Visualizing Any Kind of Term Score

We can use Scattertext to visualize alternative types of word scores and ensure that 0 scores are greyed out. Use the sparse_explroer function to acomplish this, and see its source code for more details.

import spacy
from gensim.models import word2vec
from scattertext import SampleCorpora, word_similarity_explorer_gensim, Word2VecFromParsedCorpus
from scattertext.CorpusFromParsedDocuments import CorpusFromParsedDocuments
from sklearn.linear_model import Lasso
from scattertext import sparse_explorer
nlp = spacy.load("en_core_web_sm")

convention_df = SampleCorpora.ConventionData2012.get_data()
convention_df['parsed'] = convention_df.text.apply(nlp)
corpus = CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()

# Create html visualization
html = sparse_explorer(corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    scores = corpus.get_regression_coefs('democrat', Lasso(max_iter=10000)),
    minimum_term_frequency=5,
    pmi_threshold_coefficient=4,
    width_in_pixels=1000,
    metadata=convention_df['speaker'])

# Open the html for interactive visualization
# Please run the below code in your interactive window (like Python shell) if you do not see the HTML opening up
open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

Reference (Additional Resources)

Penn Libraries Home Franklin Home
(215) 898-7555