#s-lg-box-27454552-container #s-lg-col-3 h2.s-lib-box-title {display: block;} Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Image at top shows a map of demographic data for Philadelphia

Text Analysis at Penn Libraries

A guide to text mining tools and methods

TDM Glossary

Text Analysis & Natural Language Processing Terminology

Confused about certain words in the context of text analysis? Read on to learn more TDM methods and techniques.

Missing any terms? Please let us know at LibraryRDDS@pobox.upenn.edu

A glossary of Text Analysis terms that you may come across most frequently

  • API  (Application Program Interface): Software intermediary that allows two applications to talk to each other. In our case to access the features or data of an operating system, application, or other service.
  • Lemmatization: Identifying the base form of the word such as "run" in run, ran, run
  • Named Entity Recognition: Identifying proper names in a corpus
  • Natural Language Processing: Ability of a machine or program to understand human text or speech
  • N-grams: Probabilistic model in computational linguistics which identifies sequences of syllables, letters, words,etc. that can be expected in a sample of text
  • Parts of Speech Tagging: Identifying the syntactic role of a word
  • Stylometric Analysis: The quantitative study of literary style based on the observation that authors tend to write in relatively consistent, recognizable and unique ways.
  • Relation Extraction: Identifying the relationships between entities such as "daughter of" or "town in ? state"
  • Stemming: Processing rules to identify the base form of a word
  • Tokenization: Process  of separating a string of characters into tokens which may be words, phrases or sentences. In the process punctuation is removed.
  • Text preprocessing: Cleaning, normalizing, and preparing text data for analysis by removing stopwords, punctuations, wide spaces, etc.
  • Topic modeling: Coding texts into meaningful categories
  • Lemmatization: Application of a dictionary that allows a system to consider variations of a term by using the dictionary entries to normalize words by replacing morphological variations with their root (for example, replacing 'gave' and 'give' with 'give'); more sophisticated than stemming but addresses the same issue (Welbers 2016)
  • Stemming: Technique used to reduce words to their root form by removing their endings (e.g., searching for hospital* to retrieve records containing the words hospital, hospitalized, hospitalised, hospitals, etc.)
  • Specificity: In search filter development and diagnostics, refers to the percentage of true negatives (true negatives divided by the sum of true negatives and false positives); the more false positives, the worse the specificity and precision are, but these two measures are calculated differently
  • TF-IDF (Term Frequency-Inverse Document Frequency): Weighting scheme used in word frequency analysis for storing text as weighted vectors; words with high frequency receive high weight unless they also have a high document frequency (e.g., stop words); "for high document frequency words, the competing effects cancel each other and give the word a low weight"
  • Categorization: Text categorization is the assignment of labels, typically from a predefined set, to a text document; one approach is based on hand coding, another on machine learning (two types of machine learning approaches: classification and clustering)​
  • Classification: Categorization of documents using supervised machine learning; training set and predetermined labels are provided to train a classifier to correctly assign labels to uncategorized documents (Shatkay 2012)
  • Clustering: Categorization of documents using unsupervised machine learning; goal is to produce clusters of documents that are similar to each other according to some criterion, with different clusters for dissimilar documents (Shatkay 2012)
  • Collocates: A word’s collocates are words that appear next to or near it (Glanville 2016) 
  • Geographical Analysis: Using mapping tools along with text analysis to plot terms in space