{"id":44,"date":"2023-02-01T01:01:41","date_gmt":"2023-02-01T01:01:41","guid":{"rendered":"https:\/\/live-digitalscholarship-library-cornell-edu.pantheonsite.io\/?p=44"},"modified":"2026-02-17T18:26:08","modified_gmt":"2026-02-17T18:26:08","slug":"text-as-data","status":"publish","type":"post","link":"https:\/\/digitalscholarship.library.cornell.edu\/?p=44","title":{"rendered":"Introduction to text as data"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Text data and analysis<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Computational text analysis is \u201c<a href=\"https:\/\/cmu-lib.github.io\/dhlg\/topics\/\">the process of deriving information by way of statistical pattern learning<\/a>\u201d from a body of text, often called a corpus (or, corpora for multiple bodies of text). Text analysis methods allows us to find patterns in large amounts of texts that might not be clear to us just by close reading. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Miriam Posner provides a helpful overview of common text analysis methods being used in digital humanities settings in the following video:<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Text Analysis: A Walking Tour of What People are Using in Digital Humanities Right Now\" width=\"800\" height=\"450\" src=\"https:\/\/www.youtube.com\/embed\/X5eAULsBm0w?start=929&#038;feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">How to find text data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Examples of text as data include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Books (digital editions or print copies)<\/li>\n\n\n\n<li>Newspaper articles&nbsp;<\/li>\n\n\n\n<li>Journal articles<\/li>\n\n\n\n<li>Social media content<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Some texts are readily available for analysis or download (e.g., digital collections in\u00a0<a href=\"https:\/\/guides.library.cornell.edu\/text-as-data\/hathitrust\"><strong>HathiTrust<\/strong><\/a>), other texts may need to be scanned with\u00a0<a href=\"https:\/\/guides.library.cornell.edu\/text-as-data\/ocr\"><strong>Optical Character Recognition<\/strong><\/a>\u00a0software (e.g., physical collections), and others still may need to be digitally scraped using webscraping or an\u00a0API\u00a0(Application Programming Interface).\u00a0\u00a0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For any texts that you work with, you should also consider\u00a0<a href=\"https:\/\/guides.library.cornell.edu\/text-corpora\/copyright\"><u>copyright &amp; license restrictions<\/u><\/a>, depending on\u00a0where you would like to collect the texts. For example, some newspapers and articles available in\u00a0<a href=\"https:\/\/catalog.library.cornell.edu\/databases\">Cornell University Library\u2019s databases<\/a>\u00a0have restrictions on what and how you text mine from their collections.\u00a0<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Is text analysis right for your project?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you are interested in finding patterns in a large volume of texts, text analysis may be the right method for your project. If you want to perform a close reading analysis to derive meaning from a large body of texts, you might be better off using your skills to read and&nbsp;<a href=\"https:\/\/methods.sagepub.com\/book\/analyzing-qualitative-data\/n4.xml\">manually code<\/a>&nbsp;the texts. If you are unsure of whether text analysis is right for your project,&nbsp;<a href=\"https:\/\/digitalscholarship.library.cornell.edu\/contact\">contact the Digital CoLab<\/a>&nbsp;for support.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note that text analysis is one methodology for exploring a research question. To produce robust research, it is helpful to triangulate the results of any text analysis project with different data sources or methodologies.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Resources on text mining and analysis<\/h3>\n\n\n\n<div class=\"wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-7387b849 wp-block-group-is-layout-flex\">\n<div class=\"wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-7387b849 wp-block-group-is-layout-flex\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<ul id=\"s-lg-link-list-67256803\" class=\"wp-block-list\">\n<li><a href=\"https:\/\/melaniewalsh.github.io\/Intro-Cultural-Analytics\/welcome.html\">Introduction to Cultural Analytics and Python (Melanie Walsh)<\/a>This resource, built with Jupyter Book and intended to engage learners with no prior programming experience, is a critical deep-dive into learning what text analysis is and how to perform a variety of text mining and analysis techniques. The workbook features Python code snippets and exercises to put skills into practice. There are also resources for analyzing texts in non-English languages, including Spanish, Chinese, Russian, Portuguese and Danish.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:25%\">\n<div class=\"wp-block-group is-content-justification-center is-nowrap is-layout-flex wp-container-core-group-is-layout-f56f9fcf wp-block-group-is-layout-flex\">\n<figure class=\"wp-block-image size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"265\" height=\"400\" src=\"https:\/\/live-digitalscholarship-library-cornell-edu.pantheonsite.io\/wp-content\/uploads\/2023\/12\/t1.jpeg\" alt=\"\" class=\"wp-image-45\" style=\"width:135px;height:auto\" srcset=\"https:\/\/digitalscholarship.library.cornell.edu\/wp-content\/uploads\/2023\/12\/t1.jpeg 265w, https:\/\/digitalscholarship.library.cornell.edu\/wp-content\/uploads\/2023\/12\/t1-199x300.jpeg 199w\" sizes=\"(max-width: 265px) 100vw, 265px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:75%\">\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/catalog.library.cornell.edu\/catalog\/12338287\">Text Analysis with R for Students of Literature (2nd Ed.)\u00a0by\u00a0Matthew L. Jockers &amp; Rosamond Thalken<\/a><br>ISBN: 9783030396435<br>Publication Date: 2020<br>Text Analysis with R provides a practical introduction to computational text analysis using the open source programming language R. Each chapter builds on its predecessor as readers move from small scale \u201cmicroanalysis\u201d of single texts to large scale \u201cmacroanalysis\u201d of text corpora, and each concludes with a set of practice exercises that reinforce and expand upon the chapter lessons. The book\u2019s focus is on making the technical palatable and making the technical useful and immediately gratifying. Text Analysis with R is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological toolkit to include quantitative and computational approaches to the study of text.<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:25%\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img decoding=\"async\" width=\"323\" height=\"400\" src=\"https:\/\/live-digitalscholarship-library-cornell-edu.pantheonsite.io\/wp-content\/uploads\/2023\/12\/t2.jpeg\" alt=\"\" class=\"wp-image-46\" style=\"width:161px;height:auto\" srcset=\"https:\/\/digitalscholarship.library.cornell.edu\/wp-content\/uploads\/2023\/12\/t2.jpeg 323w, https:\/\/digitalscholarship.library.cornell.edu\/wp-content\/uploads\/2023\/12\/t2-242x300.jpeg 242w\" sizes=\"(max-width: 323px) 100vw, 323px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:75%\">\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/catalog.library.cornell.edu\/catalog\/13262926\">Text mining: a guidebook for the social sciences\u00a0by\u00a0Ignatow, Gabe; Mihalcea, Rada<\/a><br>Call Number: H61.3.I395 2017<br>ISBN: 9781483369358<br>Publication Date: 2017<br>A SAGE Publications Research Methods resource, this work overviews various approaches to text mining from social sciences and humanities disciplinary perspectives. It covers the fundamentals of text mining and introduces for compiling and analyzing a corpus. Available online and in print editions at Cornell University Library.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Text data and analysis Computational text analysis is \u201cthe process of deriving information by way of statistical pattern learning\u201d from a body of text, often called a corpus (or, corpora for multiple bodies of text). Text analysis methods allows us to find patterns in large amounts of texts that might not be clear to us [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"footnotes":""},"categories":[3],"tags":[],"class_list":["post-44","post","type-post","status-publish","format-standard","hentry","category-text-as-data"],"_links":{"self":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts\/44","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=44"}],"version-history":[{"count":10,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts\/44\/revisions"}],"predecessor-version":[{"id":1411,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts\/44\/revisions\/1411"}],"wp:attachment":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=44"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=44"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=44"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}