Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

About spaCy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It is designed for production use which helps users to comprehend large volumes of text. It has a wide range of applications in information extraction, natural language understanding, and text pre-processing.

spaCy is also a powerful package that helps simplify data and derive in-depth information from input text messages. Its features and capabilities give insights into a text’s grammatical structure that can be particularly helpful in the following fields.

  • Sentence detection and Tokenization: spaCy can break the input text into linguistically meaningful or basic units for future analyses.
  • Stop word removal: spaCy can remove the common words in English so that they would not distort tasks such as word frequency analysis.
  • Text summarization: spaCy can reduce ambiguity, summarize, and extract the most relevant information, such as a person, location, or company, from the text for analysis through its Lemmatization and Named Entity Recognition feature.
  • Language translation: spaCy and other natural language processing tools can use deep learning to translate speech and text into different languages, even for specialized fields and domains.
  • Dependency and similarity parsing: spaCy assign syntactic dependency labels that study how the words are related to each other in the given test.
  • Rule-based matching: spaCy can extract specified patterns in the text, such as full names, phone numbers, and birthdays.

spaCy comes with a built-in visualizer called displaCy that visualizes a dependency parse or named entities in a browser or a Jupyter notebook.

The spaCy Google Colab Notebook version of the tutorial is also available if you would like to follow along. 

Installation

SpaCy (together with its data and models) can be installed using the python package index and setup tools.

Use the following command to install spacy using pip install on your machine:

# [Mac Terminal]
pip3 install -U pip setuptools wheel
pip3 install -U spacy

# [Jupyter Notebook]
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en

# [Conda install]
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm

Statistical Model

spaCy offers statistical models for a variety of languages, which can be installed as individual modules in Python. These models are powerful engines of spaCy that performs several NLP-related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

You can download these models for the English language by executing the following code:

# [Mac Terminal]
python3 -m spacy download en_core_web_lg
python3 -m spacy download en_core_web_sm


# [Jupyter Notebook]
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download en_core_web_lg

# [Conda install]
conda install -c conda-forge spacy-model-en_core_web_sm
conda install -c "conda-forge/label/broken" spacy-model-en_core_web_sm
conda install -c "conda-forge/label/cf202003" spacy-model-en_core_web_sm

Once downloaded, those models can be opened via spacy.load('model_name') in python. Therefore, you can verify if the models were downloaded successfully by running the following code:

import spacy
load_model = spacy.load('en_core_web_sm')

If the nlp object is created, then it means that spaCy was installed and that models and data were successfully downloaded.

Read Strings

For a given input string, you can use spaCy to create a processed object for accessing linguistic annotations:

text = ('For a given input string, you can use spaCy to create a processed object for accessing linguistic annotations.')
nlp = load_model(text)

The input text string is then converted to an object that spaCy can understand. This method can be used to convert any text into a processed object for future analysis.

Read Text File

You can also convert a .txt file into a processed object. Notice that the .txt file needs to be in the current working directory, or you will have to specify its full path. A quick reminder that you can get the current working directory with os.getcwd() and change it with os.chdir() after importing os.

import os
from google.colab import drive
drive.mount('/content/drive')    # Change the working directory on your own machine as needed with os.chdir('Path to directory')

file = 'text.txt'
file_text = open(file).read()
nlp = load_model(file_text)

You may assume that variable name ending with the suffix _doc are spaCy’s language model objects.

Sentence Detection (Sentence Boundary Detection)

Sentence Boundary Detection locates the start and end of sentences in a given text. You can divide a text into linguistically meaningful units to perform tasks such as part of speech tagging and entity extraction. In spaCy, the property sents can be used to extract sentences in a given input text:

import spacy
load_model = spacy.load('en_core_web_sm')

# Customize text or read files as needed
# Create an nlp object
nlp = load_model("Apple's name was inspired by Steve Jobs’ visit. His visit was to an apple farm while on a fruitarian diet.")

# Create a list of sentences contained in nlp. By default, the function breaks sentences by periods.
sentences = list(nlp.sents)

for sentence in sentences:
    # Print each sentence in the nlp with one sentence a line
    print(sentence)

In the previous example, spaCy can correctly identify sentences in English by treating periods as the sentence delimiters. You can also customize the delimiter in sentence detection tasks. The following example shows the case where we set the semicolon as the delimiter by simply adjusting the input text string and separator in the custom_boundaries() functions:

import spacy
import re
from spacy.language import Language

load_model = spacy.load('en_core_web_sm')
boundary = re.compile('^[0-9]$')
# Customize text as needed
text = ('Apple is read; apple is juicy; apple is sweet.')  

@Language.component("component")
def custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':    # Customize delimiters for sentence detection
            doc[token.i+1].is_sent_start = True
    return doc
load_model.add_pipe("component", before='parser')

nlp = load_model(text)
for sentence in nlp.sents:
    print(sentence)

Tokenization

Tokenization refers to a process of segmenting input text into words, punctuation, etc. It allows you to identify the basic units in your text that are called tokens. You can use the following code to achieve this purpose:

import spacy
load_model = spacy.load("en_core_web_sm")

# Create an nlp object
# Customize text as needed
nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

# Iterate over the tokens
for token in nlp:
    # Print tokens
    print(token.text)

spaCy provides various attributes for the token class. Instead of printing token.text each time, you can also do:

for token in nlp:
    print(token, token.is_alpha, token.is_punct, token.is_space)

Where is_alpha detects if the token consists of alphabetic characters; is_punct detects if the token is a punctuation symbol; is_space detects if the token is a space.

Stop Words Removal

Stop words are the most common words in a language. Examples of stop words are the, who, too, and is. We usually remove the stop words because they are not significant in many text mining tasks such as word frequency analysis. You can remove stop words by using spaCy’s list of stop words for the English language:

import spacy
load_model = spacy.load("en_core_web_sm")

# Create an nlp object
nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

for token in nlp:
    if not token.is_stop:
        print(token)

Linguistic Annotations

You can explore the text’s grammatical structure through spaCy’s linguistic annotations feature, and it also allows users to detect the type of the word (i.e, none, verb, adjective, etc.) in the speech.

import spacy
load_model = spacy.load("en_core_web_sm")

# Create an nlp object
nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

# Iterate over the tokens
for token in nlp:
    # Print tokens and their word type tag
    print(token.text, token.pos_, token.dep_)

Executing the code above will return a processed nlp, within it, you will see that the text is split into individual words and annotated by their word type.

Word Frequency

Word frequency is an analysis that gives you insights into word patterns, such as common words or unique words in the text:

import spacy
from collections import Counter

load_model = spacy.load('en_core_web_sm')
text = ('Apple\'s name was inspired by his visit to an apple farm while he was on a fruitarian diet')
text_doc = load_model(text)

# Remove stop words and punctuation symbols
words = [token.text for token in text_doc if not token.is_stop and not token.is_punct]
word_freq = Counter(words)

# 5 commonly occurring words with their frequencies
common_words = word_freq.most_common(5)
print(common_words)

# Unique words
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print(unique_words)

Part-Of-Speech (POS) Tagging

Part of speech (POS) analyzes the grammatical role each word plays in a sentence. In other words, it determines to which category each word (Noun, Pronoun, Adjective, Verb, Adverb, Preposition, Conjunction, and interjection) belongs. POS tags are useful when you want to assign a syntactic category to each word of the text for future analysis.

import spacy
load_model = spacy.load('en_core_web_sm')

# Create an nlp object
nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

# Iterate over the tokens
for token in nlp:
    # Print the token and its part-of-speech tag
    print(token, token.tag_, token.pos_, spacy.explain(token.tag_))

We can use spaCy’s built-in displaCy visualizer to visualize nlp by executing displacy.serve. The displaCy can either take a single nlp or a list of nlp objects as its first argument. Note that there are yet no solutions available at the moment to display web content on Python consoles. A way to see the visualization is through the Jupyter notebook.

# Import SpaCy in Jupyter notebook
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en
!python -m spacy download en_core_web_sm
import spacy

# Import displacy from spacy
from spacy import displacy
load_model = spacy.load('en_core_web_sm')
nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

# Visualize nlp
displacy.render(nlp, style="dep", jupyter=True)

Dependency Parsing

Dependency parsing is the process of extracting the dependency parsing of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. It helps you understand what role a word plays in the text and how words relate to each other.

In dependency parsing, the head of a sentence is called the root of the sentence and has no dependency. The main verb or action is usually the head of the sentence and is denoted by the dependency tag ROOT. Other words are directly or indirectly to the headword.

To use dependency parsing and explore the relationships between words:

import spacy
load_model = spacy.load('en_core_web_sm')
nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

for token in nlp:
    # Print the tokens and their dependency tag
    print(token.text, "-->", token.dep_)

You can find out what other tags stand for by executing the code below. A tuple will be returned that contains words in the text that corresponds to each dependency tag:

spacy.explain("nsubj"), spacy.explain("ROOT"), spacy.explain("aux"), spacy.explain("advcl"), spacy.explain("dobj")

Lemmatization

Lemmatization is the process of reducing inflected forms of a word while ensuring that the reduced form belongs to a language. This reduced form or root word is called a lemma. For example, “organizes”, “organized”, and “organizing” are all forms of “organize” (lemma). Here, organize is the lemma. The inflection of a word also reduces numbers (car vs cars).

Lemmatization is an important step because it helps you reduce the inflected forms of a word so that they can be analyzed the text more efficiently.

To perform lemmatization, use the spaCy attribute lemma_ on the tokenized object. This attribute has the lemmatized form of a token:

import spacy
load_model = spacy.load('en_core_web_sm')

nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

for token in nlp:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.lemma_)

Named Entity Recognition (NER)

A named entity is an object’s assigned name, for example, a person’s name, a film, a book title, or a song’s name. spaCy can recognize these named entities in a document by asking the model for a prediction. Because the performance of the models depends on the examples they were trained on, NEF might not always work perfectly and you might need to adjust the tuning based on your case.

Named entity recognition can be accomplished when you apply the ents property to an nlp:

import spacy

load_model = spacy.load("en_core_web_sm")
nlp = load_model("Apple's name was inspired by his visit to an apple farm while on a fruitarian diet.")

for ent in nlp.ents:
    # For each identified named entity, Python will print out the text, its starting position, ending position, and named entity label
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Entity Detection

Entity detection is a more advanced form of language processing. ED identifies entities in a sentence such as specific locations (GPE), date-related words (DATE), important numbers (CARDINAL), and specific individuals (PERSON) within an input string of text. It is helpful when you want to identify key information from text.

We use the property label to grab a label for each entity that has been detected in the text. You can also visualize those entities in spaCy‘s displaCy visualizer.

import spacy
from spacy import displacy
load_model = spacy.load("en_core_web_sm")
nlp = load_model("""The Amazon rainforest,[a] alternatively, the Amazon Jungle, also known in English as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America. """)

for i in nlp.ents:
    # For each identified named entity, Python will print out the text, and entity label
    print(i, i.label_)

 

Using displaCy we can also visualize our input text, with each identified entity highlighted by color and labeled. We’ll use style = "ent" to tell displaCy that we want to visualize entities here:

displacy.render(nlp, style = "ent", jupyter = True)

Similarity

The similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Spacy provides inbuilt integration of dense, real-valued vectors representing distributional similarity information.

import spacy
load_model = spacy.load("en_core_web_lg")

nlp = load_model("dog cat banana afskfsd")

for token in nlp:
    # Print the token text, the boolean value of whether the token is part of the model’s vocabulary, dimensions, and the boolean value of whether the token is out-of-vocabulary
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Note that the words “dog”, “cat” and “banana” are all common in English, so they’re part of the model’s vocabulary, and come with a vector. On the other hand, the word “afskfsd” is a lot less common and out-of-vocabulary, so its vector representation consists of 300 dimensions of 0, meaning it’s practically nonexistent.

spaCy is able to compare two objects and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates:

import spacy
load_model = spacy.load("en_core_web_lg")    # make sure to use larger package!

nlp1 = load_model("I like salty fries and hamburgers.")
nlp2 = load_model("Fast food tastes very good.")

# Similarity of two documents
print(nlp1, "<->", nlp2, nlp1.similarity(nlp2))

# Similarity of tokens and spans
french_fries = nlp1[2:4]
burgers = nlp2[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

In this case, the model tells us the similarity between two documents and between tokens and spans. You can also determine words’ similarities within one sentence.

import spacy

load_model = spacy.load("en_core_web_lg")
nlp = load_model("dog cat banana")

for token1 in nlp:
    for token2 in nlp:
        print(token1.text, token2.text, token1.similarity(token2))

For the result, we see that dog and cat are much more similar to each other (similarity score = 1.00) than dog and banana (similarity score = 0.209).

Rule-Based Matching

Rule-based matching extracts information from unstructured text by identifying tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech). Rule-based matching can use regular expressions to extract entities (such as phone numbers) from an unstructured text. With rule-based matching, you can extract a first name and the last name, which are always proper nouns:

import spacy
load_model = spacy.load('en_core_web_sm')

from collections import Counter
from spacy.matcher import Matcher

text_doc = load_model('Steven Jobs (February 24, 1955 - October 5, 2011) was an American entrepreneur, industrial designer, business magnate, media proprietor, and investor. ')
matcher = Matcher(nlp.vocab)

def extract_full_name(text_doc):
    pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    matcher.add('FULL_NAME', [pattern])
    matches = matcher(text_doc)
    for match_id, start, end in matches:
        span = text_doc[start:end]
        return span.text

print(extract_full_name(text_doc))

The pattern is a list of objects that defines the combination of tokens to be matched. In our case, we want both the extracted words’ POS tags to be PROPN (proper noun). This pattern is then added to Matcher using a matching id “FULL_NAME”.

Penn Libraries Home Franklin Home
(215) 898-7555