DIGITAL SCHOLARSHIP GUIDES

Introduction to text as data

Text data and analysis

Computational text analysis is “the process of deriving information by way of statistical pattern learning” from a body of text, often called a corpus (or, corpora for multiple bodies of text). Text analysis methods allows us to find patterns in large amounts of texts that might not be clear to us just by close reading.

Miriam Posner provides a helpful overview of common text analysis methods being used in digital humanities settings in the following video:

How to find text data

Examples of text as data include:

  • Books (digital editions or print copies)
  • Newspaper articles 
  • Journal articles
  • Social media content

Some texts are readily available for analysis or download (e.g., digital collections in HathiTrust), other texts may need to be scanned with Optical Character Recognition software (e.g., physical collections), and others still may need to be digitally scraped using webscraping or an API (Application Programming Interface).  

For any texts that you work with, you should also consider copyright & license restrictions, depending on where you would like to collect the texts. For example, some newspapers and articles available in Cornell University Library’s databases have restrictions on what and how you text mine from their collections. 

Is text analysis right for your project?

If you are interested in finding patterns in a large volume of texts, text analysis may be the right method for your project. If you want to perform a close reading analysis to derive meaning from a large body of texts, you might be better off using your skills to read and manually code the texts. If you are unsure of whether text analysis is right for your project, contact the Digital CoLab for support. 

Note that text analysis is one methodology for exploring a research question. To produce robust research, it is helpful to triangulate the results of any text analysis project with different data sources or methodologies.

Resources on text mining and analysis

Text Analysis with R for Students of Literature (2nd Ed.) by Matthew L. Jockers & Rosamond Thalken
ISBN: 9783030396435
Publication Date: 2020
Text Analysis with R provides a practical introduction to computational text analysis using the open source programming language R. Each chapter builds on its predecessor as readers move from small scale “microanalysis” of single texts to large scale “macroanalysis” of text corpora, and each concludes with a set of practice exercises that reinforce and expand upon the chapter lessons. The book’s focus is on making the technical palatable and making the technical useful and immediately gratifying. Text Analysis with R is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological toolkit to include quantitative and computational approaches to the study of text.

Text mining: a guidebook for the social sciences by Ignatow, Gabe; Mihalcea, Rada
Call Number: H61.3.I395 2017
ISBN: 9781483369358
Publication Date: 2017
A SAGE Publications Research Methods resource, this work overviews various approaches to text mining from social sciences and humanities disciplinary perspectives. It covers the fundamentals of text mining and introduces for compiling and analyzing a corpus. Available online and in print editions at Cornell University Library.

On this Page