Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

Programming Languages for Natural Language Processing

For a more general overview of Python, see UPenn Libraries Python Libguide.

The spreadsheet provides a comprehensive resource for individuals seeking to understand the key steps involved in TDM (Text Data Management). It covers various aspects such as Normalization, Noise Removal, Tokenization, Word-level Analysis, Word Association Analysis, Advance Analysis, and Data Visualization. The spreadsheet also includes an introduction to R and Python packages that can be used to effectively carry out these processes. This resource serves as a tool for individuals who are looking to gain a deeper understanding of the intricacies involved in TDM and the methods to carry out these processes efficiently.

Glossary

Confused about certain words in the context of text analysis? Read on to learn more TDM methods and techniques.

Missing any terms? Please let us know at LibraryRDDS@pobox.upenn.edu

Infograph of the TDM methods

A glossary of Text Analysis terms that you may come across most frequently

  • API  (Application Program Interface): Software intermediary that allows two applications to talk to each other. In our case to access the features or data of an operating system, application, or other service.
  • Lemmatization: Identifying the base form of the word such as "run" in run, ran, run
  • Named Entity Recognition: Identifying proper names in a corpus
  • Natural Language Processing: Ability of a machine or program to understand human text or speech
  • N-grams: Probabilistic model in computational linguistics which identifies sequences of syllables, letters, words,etc. that can be expected in a sample of text
  • Parts of Speech Tagging: Identifying the syntactic role of a word
  • Stylometric Analysis: The quantitative study of literary style based on the observation that authors tend to write in relatively consistent, recognizable and unique ways.
  • Relation Extraction: Identifying the relationships between entities such as "daughter of" or "town in ? state"
  • Stemming: Processing rules to identify the base form of a word
  • Tokenization: Process  of separating a string of characters into tokens which may be words, phrases or sentences. In the process punctuation is removed.
  • Text preprocessing: Cleaning, normalizing, and preparing text data for analysis by removing stopwords, punctuations, wide spaces, etc.
  • Topic modeling: Coding texts into meaningful categories
  • Lemmatization: Application of a dictionary that allows a system to consider variations of a term by using the dictionary entries to normalize words by replacing morphological variations with their root (for example, replacing 'gave' and 'give' with 'give'); more sophisticated than stemming but addresses the same issue (Welbers 2016)
  • Stemming: Technique used to reduce words to their root form by removing their endings (e.g., searching for hospital* to retrieve records containing the words hospital, hospitalized, hospitalised, hospitals, etc.)
  • Specificity: In search filter development and diagnostics, refers to the percentage of true negatives (true negatives divided by the sum of true negatives and false positives); the more false positives, the worse the specificity and precision are, but these two measures are calculated differently
  • TF-IDF (Term Frequency-Inverse Document Frequency): Weighting scheme used in word frequency analysis for storing text as weighted vectors; words with high frequency receive high weight unless they also have a high document frequency (e.g., stop words); "for high document frequency words, the competing effects cancel each other and give the word a low weight"
  • Categorization: Text categorization is the assignment of labels, typically from a predefined set, to a text document; one approach is based on hand coding, another on machine learning (two types of machine learning approaches: classification and clustering)​
  • Classification: Categorization of documents using supervised machine learning; training set and predetermined labels are provided to train a classifier to correctly assign labels to uncategorized documents (Shatkay 2012)
  • Clustering: Categorization of documents using unsupervised machine learning; goal is to produce clusters of documents that are similar to each other according to some criterion, with different clusters for dissimilar documents (Shatkay 2012)
  • Collocates: A word’s collocates are words that appear next to or near it (Glanville 2016) 
  • Geographical Analysis: Using mapping tools along with text analysis to plot terms in space
Penn Libraries Home Search the Catalog
(215) 898-7555