Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

About TextBlob

TextBlob is a free, open-source library in Python for processing textual data. It is a powerful package that reduces the complexity of the contextual data and derives in-depth information from the text. Like spaCy, its features and capabilities give insights into a text’s grammatical structure that can be particularly helpful in the following fields.

TextBlob is particularly useful for the following tasks:

  • Phrase extraction: TextBlob helps explore text’s grammatical structure through linguistic annotations and extraction feature
  • Part-of-speech tagging: TextBlob analyzes the grammatical role each word plays in a sentence
  • Sentiment analysis: TextBlob feature allows us to determine whether the input textual data has a positive, negative, or neutral tone
  • Tokenization: TextBlob can break the input text into linguistically meaningful or basic units for future analyses.
  • Word and phrase frequencies: TextBlob can give insights into word patterns in the text
  • Lemmatization: TextBlob can reduce inflected forms of a word into root words called a lemma
  • Spelling correction: TextBlob can help correct spelling in the input text document

The TextBlob Google Colab Notebook version of the tutorial is also available if you would like to follow along. 

Installation

TextBlob and the necessary NLTK corpora can be installed using the python package index and setup tools.

Use the following command to install spacy using pip install on your machine:

# [Mac Terminal]
pip3 install -U textblob
pip3 python -m textblob.download_corpora

# [Jupyter Notebook]
import sys
!{sys.executable} -m pip install textblob
!{sys.executable} -m textblob.download_corpora

# [Conda install]
conda install -c conda-forge textblob
python -m textblob.download_corpora

If you only intend to use TextBlob’s default models (no model overrides), you can pass the lite argument. This downloads only those corpora needed for basic functionality.

# [Mac Terminal]
python -m textblob.download_corpora lite

# [Jupyter Notebook]
import sys
!{sys.executable} -m textblob.download_corpora lite

# [Conda install]
python -m textblob.download_corpora lite

Read Strings

For a given input string, you can use TextBlob to create a processed object for accessing linguistic annotations:

text = ('For a given input string, you can use TextBlob to create a processed object for accessing linguistic annotations.')
text_doc = TextBlob(text)

The input text string is then converted to an object that TextBlob can understand. This method can be used to convert any text into a processed object for future analysis.

Read Text File

You can also convert a .txt file into a processed object. Notice that the .txt file needs to be in the current working directory, or you will have to specify its full path. A quick reminder that you can get the current working directory with os.getcwd() and change it with os.chdir() after importing os.

import os
from google.colab import drive

drive.mount('/content/drive')    # Change working directory on your own machine as needed with os.chdir('Path to directory')

file = 'text.txt'
file_text = open(file).read()
file_doc = TextBlob(file_text)

You may assume that variable name ending with the suffix _doc are TextBlob‘s language model objects.

Noun Phrase Extraction

Noun Phrase Extraction extracts the nouns in a given text. You can use the extracted nouns to perform additional analysis.

In TextBlob, noun phrases are accessed through the noun_phrases property.

from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")

# Extract Nouns phrases from object
output.noun_phrases

Tokenization

Tokenization refers to a process of segmenting input text into words, punctuation, etc. It allows you to identify the basic units in your text that are called tokens. You can use the following code to achieve this purpose:

from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")

# Extract words from object
output.words

TextBlob can also extract sentences in the input text document:

# Extract sentences from object

output.sentences

Sentence objects have the same properties and methods as TextBlobs. For example: you can perform simple sentiment analysis on the sentences:

# Perform a simple sentiment analysis

for sentence in output.sentences:
    print(sentence.sentiment)

Sentiment Analysis

TextBlob feature allows us to determine whether the input textual data has a positive, negative, or neutral tone.

The sentiment function in TextBlob returns a sentiment tuple of the form (polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")

# Perform sentiment analysis
output.sentiment

 

Word Frequency

Word frequency is an analysis that gives you insights into word patterns, such as common words or unique words in the text:

from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")

# Word count for a specific string in the TextBlob object
output.word_counts['his']

You can specify whether or not the search should be case-sensitive (the default is set as False).

# Word count for a specific string in the TextBlob object, case sensitive
output.words.count('his', case_sensitive=True)

Part-Of-Speech (POS) Tagging

Part of speech (POS) analyzes the grammatical role each word plays in a sentence. In other words, it determines to which category each word (Noun, Pronoun, Adjective, Verb, Adverb, Preposition, Conjunction, and interjection) belongs. POS tags are useful when you want to assign a syntactic category to each word of the text for future analysis. Part-of-speech tags can be accessed through the tags property in TextBlob

from textblob import TextBlob

# Create an nlp object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit to an apple farm while on a fruitarian diet.")

# Print out POS tagging
output.tags

Word Inflection

Once we convert a TextBlob object into TextBlob.words or TextBlob.words, they are converted to a Word object, and we can perform useful methods to those objects, such as word inflection. That is, to change in the form of a word.

from textblob import Word
from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit to an apple farm while on a fruitarian diet.")

# Break the TextBlob object into units of word
print(output.words)

# Convert the 9th word to plural form
output.words[10].pluralize()

 

In fact, TextBlobs are like Python strings. You can use Python’s substring syntax.

output[0:5]

You can use common string methods to change text to upper or lower cases:

output.upper()
output.lower()

You can make comparisons between TextBlobs and strings.

output[0:5] > output[10:12]

N-grams

The TextBlob.ngrams() function returns a list of tuples of n successive words. The tuple can later be used to predict the most probable word that might follow this sequence, which can be particularly useful in speech recognition, machine translation, and predictive text input.

from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")

# Returns a list of tuples of 3 successive words from object
output.ngrams(n=3)

Spelling Correction

TextBlob's correct() function can be applied to perform spelling correction tasks. That is, to correct spellings or perform spell checks. Notice that we intentionally spelled Apple and its name incorrectly in the demonstration.

from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("Appple's namee was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")

# Correct spelling
output.correct()

The Word.spellcheck() function allows us to examine a list of (word, confidence) tuples with spelling suggestions. Notice that we intentionally spelled "misspellled" wrong, and the confidence score shows that we are 100% sure that "misspellled" is misspelled.

from textblob import Word

# Create word object
word = Word('misspellled')

# Perform spellcheck
word.spellcheck()

Summary of the Text

A simple strategy, as provided by Analytics Vidhya, that can be adopted to summarize the given task was to extract a list of nouns from the text to give a general idea to the reader about the things the text is related to.

from textblob import TextBlob
import random    # Random is a built-in Python package

# Create a TextBlob object
output = TextBlob("An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.")

# Create a list of extracted nouns
nouns = list()
for word, tag in output.tags:
    if tag == 'NN':    # tag == 'NN' represents that the word is classified as a noun by TextBolb
    nouns.append(word)

# We randomly extracted a list of 5 nouns from the text to give a general idea
for item in random.sample(nouns, 5):
    word = Word(item)
    print(word.pluralize())

Language Detection and Translation

Once we convert an input text to TexbBlob object, we can apply the detect_language() function to detect what language it is written in. This feature can be particularly helpful after you extracted text information from images, as we introduced in the importing file tutorial.

Once we know what the language the text was written in and the language we would like to Translate to, we can use the following code to achieve this purpose:

from textblob import TextBlob

# Create a TextBlob object
output = TextBlob("An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.")

# Detect language of the input text
output.detect_language()

# Translate to Arabic
output.translate(from_lang='en', to ='eo')

You can also translate this piece of text into 100+ languages! All language codes are presented here:

import sys
!{sys.executable} -m pip install googletrans
from googletrans import LANGUAGES
Language_codes = dict(map(reversed, LANGUAGES.items()))

Language_codes

References (Additional Resources)

Penn Libraries Home Franklin Home
(215) 898-7555