Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

Importing Files

To access any information stored in the files, you must first make it import it into Python. Otherwise, Python would not recognize and allow us to use undefined items. It is often a reality that you will have to work to work with different file formats. So this tutorial will briefly introduce some of the most common file formats, along with the correct way to read them in Python.

The Importing Data Google Colab Notebook version of the tutorial is also available if you would like to follow along. 

Local Machine

In Python, you can get and change the current working directory using os.getcwd() and os.chdir(). But before we do this, we need to import the os module, which is in the standard library, so no additional installation is needed.

Running the following code produces any error message as "path-to-desired-working-directory" is not a valid path name. You can copy and paste that to your local machine and change that to your desired directory before running it.

import os

os.getcwd()                                      # Get the current working directory.
os.chdir('path-to-desired-working-directory')    # Change the current working directory.

Google Colab

Assume that we are importing the file called "test.txt" on your local machine. To import it into Google Colab, we should first upload the file into the subdirectory in your Google Drive.

Then, we should allow CoLab to mount your Google drive for file reading and writing purposes by executing the following code. The code will immediately demand authorization. Click on "Connect to Google Drive" when the dialogue box pops up, select the account, and grant permission to mount your drive.

from google.colab import drive
drive.mount('/content/drive')

Connect to the directory containing your uploaded file. To find that directory, import the "os" package as we previously introduced. You can then use "listdir" to find the files in your working directory and "chdir" methods to change your working directory to the place you uploaded "test.txt".

import os

print (sorted(os.listdir()))    # Will list content in the current working directory
                                # you can change it by filling in a path to a directory.
os.chdir('path-to-desired-working-directory')    # Change working directory

# Now you are ready to import "test.txt"

Importing Files (Reading Text Files)

Text files are one of the most common file formats to store data. In Python, you can use the open() function to read the .txt files.

Notice that the open() function takes two input parameters: file path (or file name if the file is in the current working directory) and the file access mode. There are many modes for opening a file:

 

  • open('path','r'): opens a file in read mode
  • open('path',w'): opens or creates a text file in write mode
  • open('path',a'): opens a file in append mode
  • open('path','r+'): opens a file in both read and write mode
  • open('path',w+'): opens a file in both read and write mode
  • open('path',a+'): opens a file in both read and write mode
text_file = open('path-to-file','r')    # Full path to the txt file, or simply the file name 
                                        # if the file is in your current working directory

After opening the file with the read mode, you can also use the following function to access or examine the Information stored in the file:

 

  • .read(<number>): This function reads the complete information from the file unless a number is specified. Otherwise, it will read the first n bytes from the text files.
  • .readline(<number>): This function reads the information from the file but not more than one line of information unless a number is specified. Otherwise, it will read the first n bytes from the text files. It is usually used in loops
  • .readlines() – This function reads the complete information in the file and prints them as well in a list format
text_file.read()    # Read complete information from the file

Importing Files (Reading CVS Files)

The CSV (or Comma Separated Value) files are also one of the most common file formats that data scientists work with. The name "Comma Separated" means that those files use a “,” as a delimiter to separate the values.

We usually use the Pandas library to read CSV files:

import pandas as pd

dataframe = pd.read_csv('path-to-file')    # Full path to the txt file, or simply the file name 
                                           # if the file is in your current working directory

Sometimes, you may have the dataset in .csv file format but has the data items separated by a delimiter other than a comma. Possible delimiters include semicolons, colons, tab spaces, vertical bars, etc. In such cases, we need to use the 'sep' parameter inside the read.csv() function to read the values properly

dataframe = pd.read_csv('path-to-file', sep = ';')

Notice that we use '|' to represent the vertical-bar separator, and we use 't' to represent the tab-separator.

Importing Files (Reading Excel Files)

Pandas library also has a function, read_excel(), to read Excel files:

dataframe = pd.read_excel('path-to-file')    # Full path to the txt file, or simply the file name 
                                             # if the file is in your current working directory

It is often the case where an Excel file can contain multiple sheets. You can read data from any sheet by providing its name in the sheet_name parameter in the read_excel() function:

dataframe = pd.read_excel('path-to-file', sheet_name='sheet-name')

You can get the list of dataframe headers using the columns property of the dataframe object.

print(dataframe.columns.ravel())

Sometimes, you might want to use one column of data for Analysis. To do this, you can get the column data and convert it into a list of values.

print(dataframe['column-name'].tolist())

Importing Files (Extracting from Zip Files)

To open a ZIP folder, you will first need to import the zip file library in Python, which is in the standard library as well, so no additional installation is needed.

from zipfile import ZipFile    # Import zipfile


file = 'path-to-file'          # Full path to the zip file, or simply the file name 
                               # if the file is in your current working directory
zip_file = ZipFile(file, 'r')  # Read zipfile and extract contents
zip_file.printdir()            # List files in the zip files
zip_file.extractall()          # Extract files in the zip file

Once you run the above code, you can see the extracted files are in the same folder as your script.

Importing Files (Working with JSON Files)

JSON (JavaScript Object Notation) files store data within square brackets {}, which is similar to how a dictionary stores information in Python. The major benefit of JSON files is that they are language-independent, meaning they can be used with any programming language.

Python has a built-in json module to read JSON files. The read function is json.load() function, which takes a JSON file and returns a JSON dictionary.

import json


file = open('path-to-file')    # Full path to the json file, or simply the file name 
                               # if the firle is in your current working directory
json_file = json.load(file)    # Returns JSON object as a dictionary

Once you have executed the code above, you can convert it into a Pandas dataframe using the pandas.DataFrame() function:

dataframe = pd.DataFrame(json_file)

You can also load the JSON file directly into a dataframe using the pandas.read_json() function:

dataframe = pd.read_json('path-to-file')

Now, the JSON file can be manipulated just like a Pandas Dataframe

Importing Files (Working with Pickle Files)

Pickle files store the serialized form of Python objects, including lists, dictionaries, tuples, etc. They are converted to a character stream before being stored in the Pickle files. Storing those Python objects as Pickle files allows you to continue working with the objects later on. Pickle files are particularly useful when you are training your machine learning model and want to save them for making predictions later.

To open a Pickle file, you must de-serialize them first. Fortunately, there is a built-in Python package, pickle, to help us with that, so you can use the pickle.load() function in the pickle module to load a Pickle file.

Notice that, to read a binary file, you need to provide an 'rb' parameter when you open the pickle file with Python’s open() function.

import pickle

file = open('path-to-file','rb')    # Full path to the Pickle file, or simply the file name 
                                    # if the file is in your current working directory
pickle_file = pickle.load(file)     # Load Pickle file

Once you have executed the code above, you can also convert it into a Pandas dataframe using the pandas.DataFrame() function:

dataframe = pd.DataFrame(pickle_file)

Importing Files (Extracting Information from Images)

Extracting information from images is another useful technique in text analysis.

 

Luckily, with the help of Optical Character Recognition (OCR), an electronic computer-based approach to converting images of text into machine-encoded text, we can achieve this task. pytesseract is a Python library that performs OCR (optical character recognition) tasks which allows to the extraction of text from images.

 

In order to use pytesseract in Python, we will also need the library. Since we are working with images, we will also need to install the pillow library which performs image processing in Python.

 

To do this, we need to first install Tesseract on your operating system.

(1) For Windows: the latest version of the Tesseract installer can be found in this Colab notebook. Simply download the .exe file and install it on your computer.

(2) For Mac: the easiest way to install Tesseract OCR on Mac is by using Homebrew. You can download Homebrew through the site. After it is installed, you can type the following code in the console to download Tesseract:

# Please run them on your local machine

brew install tesseract

The next step would be to install the packages using the following code.

# Please run them on your local machine

pip install pytesseract
pip install pillow

(3) For Google Colab:

!sudo apt install tesseract-ocr
!pip install pytesseract

In this tutorial, we will only use simple images with text aligned horizontally. Note that the following code may not work for images that require additional image processing on your local machine:

from PIL import Image
from pytesseract import pytesseract


path_to_tesseract = 'path-to-tesseract.exe'    # Define path to tessaract.exe
pytesseract.tesseract_cmd = path_to_tesseract  # Point tessaract_cmd to tessaract.exe

image = Image.open('path-to-image')            # Open image with Pillow
text = pytesseract.image_to_string(image)      # Extract text from image
print(text)

On Google Colab

from PIL import Image
from pytesseract import pytesseract
from google.colab import files

uploaded = files.upload()
image = Image.open('path-to-image')          # Open image with Pillow
text = pytesseract.image_to_string(image)    # Extract text from image

print(text)

Importing Files (Web Scraping)

Web Scraping refers to the task of extracting data from a website. Python contains powerful modules to retrieve data from websites for future analysis.

The get() function in the requests package takes a URL as its parameter and returns the HTML response as its output. Then, Python packages the scraping request from the website, sends the request to the server, receives the HTML response, and stores it in a Python object.

import requests

url = "path-to-website"
response = requests.get(url)    # Generate response object
text = response.text            # Return the HTML of webpage as string
print(text)
Penn Libraries Home Franklin Home
(215) 898-7555