The process of building a corpus is complex, and researchers just getting started with text mining might want to rely on corpora that have been compiled and rigorously validated by others. The corpora listed on this page were built with particular goals or research questions in mind, and are likely to provide solid foundations for a first text mining project. Some are very large, and can be broken into subsets useful for answering questions about narrower time spans, genres, or geographic regions. Researchers interested in building entirely new corpora should look at Sources of Text Data instead.
Penn Libraries is committed to providing research support and services for TDM projects to students and researchers across campus. To make these services accessible to a large audience, they are limited in scope.
We can provide
We generally cannot provide
Depending on vendor policies and our own support capacity, we may be able to provide some of the above services on a case-by-case basis, but we can make no guarantees. These limits have been adapted from the Northeastern University TDM guide.