Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

About NLTK

NLTK is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can help simplify textual data and gain in-depth information from input messages. Because of its powerful features, NLTK has been called “a wonderful tool for teaching and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

The NLTK Google Colab Notebook version of the tutorial is also available if you would like to follow along. 

It is often the case that your input data is unstructured. Therefore, you will first need to preprocess it before doing any analysis. In this tutorial, we will be going over basic text preprocessing techniques that NLTK supports. Then, we will also be introducing some of the advanced features of NLKT:

  • Sentence detection and Tokenization: NLTK can break the input text into linguistically meaningful or basic units for future analyses.
  • Stop word removal: NLTK can remove the common words in English so that they would not distort tasks such as word frequency analysis.
  • Part-of-speech tagging: NLTK analyzes the grammatical role each word plays in a sentence.
  • Word frequencies: NLTK can give insights into word patterns in the text.
  • Lemmatization: NLTK can reduce inflected forms of a word into root words called a lemma.
  • Chunking and Chinking: NLTK allows you to identify or exclude phrases with specific patterns in a textual input.

Installation

NLTK requires Python versions 3.7, 3.8, 3.9 or 3.10

NLTK can be installed using the python package index and setup tools. Please Use the following command to install spacy using pip install on your machine:

# [Mac Terminal]
pip3 install nltk

# [Jupyter Notebook]
import sys
!{sys.executable} -m pip install nltk

# [Conda install]
conda install -c anaconda nltk

(Optional) NumPy, Matplotlib, and svgling are also needed in order to create visualizations for named entity recognition, you can install those two packages by executing the following code in your local machine:

# [Mac Terminal]
pip3 install numpy
pip3 install matplotlib
pip3 install svgling


# [Jupyter Notebook]
import sys
!{sys.executable} -m pip install numpy matplotlib
!{sys.executable} -m pip install svgling


# [Conda install]
conda install -c anaconda numpy
conda install -c anaconda matplotlib
conda install -c anaconda svgling

Read Strings

If you would like to analyze a given text, you can use simply store this message by creating a string object. Unlike spaCy, you do not need to create a processed object for accessing linguistic annotations in NLTK.

text = ('For a given input string, you can use spaCy to create a processed object for accessing linguistic annotations.')

The input text string is something that NLTK functions can understand. You are now ready to use this string for future analysis.

Read Text File

You can also access information stored in a .txt file. Notice that the .txt file needs to be in the current working directory, or you will have to specify its full path. A quick reminder that you can get the current working directory with os.getcwd() and change it with os.chdir() after importing os. Notice that you do not have to convert the file into a processed object for future analysis.

import os
from google.colab import drive

drive.mount('/content/drive')    # Change the working directory on your own machine as needed 
                                 # with os.chdir('Path to directory')
file = 'text.txt'
file_text = open(file).read()

Sentence Detection (Sentence Boundary Detection)

Sentence Boundary Detection locates the start and end of sentences in a given text. You can divide a text into linguistically meaningful units to perform tasks such as part of speech tagging and entity extraction. In NLTK, you can use the sent_tokenize() to split up the given input text into sentences.

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download('punkt')    # Use nltk downloader to download resource "punkt"
output = ("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")
# Create a string object. By default, the function breaks sentences by periods.
# Customize text or read files as needed

# Tokenize output (sentence-level)
sentences = sent_tokenize(output)

for sentence in sentences:
# Print each sentence in the output with one sentence a line
    print(sentence)

Tokenization

Tokenization refers to a process of segmenting input text into words, punctuation, etc. It allows you to identify the basic units in your text that are called tokens. You can use the following code to achieve this purpose:

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')    # Use nltk downloader to download resource "punkt"

output = ("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")
# Create a string object. By default, the function breaks sentences by periods
# Customize text or read files as needed

# Tokenize output (word-level)
sentences = word_tokenize(output)

for sentence in sentences:
# Print each sentence in the output with one sentence a line
    print(sentence)

 

Stop Words Removal

Stop words are the most common words in a language. Examples of stop words are the, who, too, and is. We usually remove the stop words because they are not significant in many text mining tasks such as word frequency analysis. You can identify and remove stop words by using NLTK's list of stop words after tokenizing the text.

nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')    # Use nltk downloader to download resource "punkt"

output = ("Apple's name was inspired by Steve Jobs' visit. His visit was to an apple farm while on a fruitarian diet.")
# Create a string object. By default, the function breaks sentences by periods.
# Customize text or read files as needed

# Get a list of stop words in English
stop_words = set(stopwords.words("english"))

# Print non-stop words
sentences = word_tokenize(output)
for token in sentences:
    if token not in stop_words:
        print(token)

Stemming

Stemming refers to a text processing task that reduces words to their root. For example, the words "adventure", "adventurer", and "adventurous" share the root adventur.” Stemming allows us to reduce the complexity of the textual data so that we do not have to worry about the details of how each word was used.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')    # Use nltk downloader to download resource "punkt"

output = ("Please share with us the adventurous adventures of adventurer Tom")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Create a stemmer object using PorterStemmer()
stemmer = PorterStemmer()

# Tokenize the text
words = word_tokenize(output)

# Print stemmed words
for word in words:
    print(stemmer.stem(word))

Take a look at this example as introduced by Real Python:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')    # Use nltk downloader to download resource "punkt"

output = ("The crew of the USS Discovery discovered many discoveries. Discovering is what explorers do")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Create a stemmer object using PorterStemmer()
stemmer = PorterStemmer()

# Tokenize the text
words = word_tokenize(output)

# Print stemmed words
for word in words:
    print(stemmer.stem(word))

Note that the 'discovery' was stemmed to 'discoveri' when 'discovering' was stemmed to 'discov.' Why would it happen? Well, there are two major flaws of stemming: understemming and overstemming:

  • Understemming: understemming happens when two related words that should be reduced to the same root are not stemmed to the same root.

  • Overstemming: overstemming happens when two unrelated words were reduced to the same stem when they should not be.

We need to be more careful when analyzing the stem words.

Word Frequency

Word frequency is an analysis that gives you insights into word patterns, such as common words or unique words in the text. With NLTK's frequency distribution feature, you can check which words show up most frequently in your text.

from nltk import FreqDist

output = ("The crew of the USS Discovery discovered many discoveries. Discovering is what explorers do")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Tokenize the text
words = word_tokenize(output)

# Find the frequency distribution in the given text
frequency_distribution = FreqDist(words)
print(frequency_distribution)

# Print the most common 10 words
print(frequency_distribution.most_common(10))

# Visualize word frequencies
frequency_distribution.plot(10, cumulative=True)

Part-Of-Speech (POS) Tagging

Part of speech (POS) analyzes the grammatical role each word plays in a sentence. In other words, it determines to which category each word (Noun, Pronoun, Adjective, Verb, Adverb, Preposition, Conjunction, and interjection) belongs. POS tags are useful when you want to assign a syntactic category to each word of the text for future analysis.

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('averaged_perceptron_tagger')    # Use nltk downloader to download resource "averaged_perceptron_tagger"

output = ("Please share with us the adventurous adventures of adventurer Tom")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Tokenize the text
words = word_tokenize(output)

# Print word and tag pairs
nltk.pos_tag(words)

where JJ is the tag for adjectives, NN is the tag for nouns, RB is the tag for adverbs, PRP is the tag for pronouns, and VB is the tag for verbs.

Lemmatization

Lemmatization is the process of reducing inflected forms of a word while ensuring that the reduced form belongs to a language. This reduced form or root word is called a lemma. For example, “visits”, “visiting”, and “visited” are all forms of “visit” (lemma). Here, "visit" is the lemma. The inflection of a word also reduces numbers (car vs cars).

Lemmatization is an important step because it helps you reduce the inflected forms of a word so that they can be analyzed in the text more efficiently.

To perform lemmatization, use the NLTK function WordNetLemmatizer() on the tokenized object:

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')    # Use nltk downloader to download resource "wordnet"
nltk.download('omw-1.4')    # Use nltk downloader to download resource "omw-1.4"

output = ("Apple's name was inspired by Steve Jobs' visits. His visits was to an apple farm while on a fruitarian diet.")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Create lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the text
words = word_tokenize(output)

# Print lemmatized words
for word in words:
    print(lemmatizer.lemmatize(word))

 

Chunking

Unlike tokenization, which allows you to identify every single word and sentence, chunking allows you to identify phrases in a textual input.

Chunking allows you to extract a word or group of words that work as a unit to perform a grammatical function.

The following examples are all examples of phrases:

“A diet"

“A fruitarian diet"

“A meaningful fruitarian diet”

import nltk
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
from IPython.display import display
import svgling
nltk.download("averaged_perceptron_tagger")    # Use nltk downloader to download resource "averaged_perceptron_tagger"

output = ("Apple's name was inspired by Steve Jobs' visits. His visits was to an apple farm while on a fruitarian diet.")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Tokenize the text
words = word_tokenize(output)

# POS tag the text
tag = nltk.pos_tag(words)

 

To perform chunking, you will have to first define a chunk grammar that tells Python which format of grammatical unit you would like to extract.

# Define grammar
grammar = "NP: {<DT>?<JJ>*<NN>}"

The rule we defined means that, the pattern we are extracting is a noun phrase:

 

  • starts with an optional (?) determiner (<DT>)
  • can have any number (*) of adjectives (<JJ>)
  • ends with a noun (<NN>)

Likewise, the rule "NP: {<DT>*<JJ>*<NN>}" means that the phrase is a noun phrase where:

 

  • starts with any number of (*) determiner (<DT>)
  • can have any number (*) of adjectives (<JJ>)
  • ends with a noun (<NN>)

Please refer to this diagram for more information:

# Create chunk parser object
chunk_parser = nltk.RegexpParser(grammar)


# Create a tree diagram for the chunking
tree = chunk_parser.parse(tag)
display(tree)

Chinking

Chinking is often used together with chunking. While chunking is used to identify and find a pattern, chinking is used to exclude a pattern.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
from IPython.display import display
import svgling
nltk.download("averaged_perceptron_tagger")    # Use nltk downloader to download resource "averaged_perceptron_tagger"

output = ("Apple's name was inspired by Steve Jobs' visits. His visits was to an apple farm while on a fruitarian diet.")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Tokenize the text
words = word_tokenize(output)

# POS tag the text
tag = nltk.pos_tag(words)

The only difference is how we define grammar rules. We use the curly braces facing outward to tell Python what patterns we want to exclude. In the following example, }<JJ>{ has curly braces facing outward, so we want to exclude adjectives: <JJ>.

# Define grammar
grammar = """Chunk: }<JJ>{"""

# Create a chunk parser object
chunk_parser = nltk.RegexpParser(grammar)

# Create a tree diagram for the chunking
tree = chunk_parser.parse(tag)
display(tree)

Named Entity Recognition (NER)

A named entity is an object’s assigned name, for example, a person’s name, a film, a book title, or a song’s name. NLTK can recognize these named entities in a document by asking the model for a prediction. Because the performance of the models depends on the examples they were trained on, NEF might not always work perfectly and you might need to adjust the tuning based on your case.

Named entity recognition can be accomplished when you apply the ents property to an output:

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('maxent_ne_chunker')    # Use nltk downloader to download resource "maxent_ne_chunker"
nltk.download('words')                # Use nltk downloader to download resource "words"

output = ("Apple's name was inspired by Steve Jobs' visits. His visits was to an apple farm while on a fruitarian diet.")
# Create a string object. By default, the function breaks sentences by periods. 
# Customize text or read files as needed

# Tokenize the text
words = word_tokenize(output)

# POS tag the text
tag = nltk.pos_tag(words)

# Create a tree diagram for chunking
tree = nltk.ne_chunk(tag)
display(tree)

Reference (Additional Resources)

Penn Libraries Home Franklin Home
(215) 898-7555