{"id":26,"date":"2023-02-02T01:02:45","date_gmt":"2023-02-02T01:02:45","guid":{"rendered":"https:\/\/live-digitalscholarship-library-cornell-edu.pantheonsite.io\/?p=26"},"modified":"2026-02-17T18:26:13","modified_gmt":"2026-02-17T18:26:13","slug":"building-your-corpus","status":"publish","type":"post","link":"https:\/\/digitalscholarship.library.cornell.edu\/?p=26","title":{"rendered":"Building your corpus"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">Understanding your research goals<\/h3>\n\n\n\n<p>In computational&nbsp;text analysis, text is your data, and&nbsp;the text corpus is your dataset. The most important part of any data analysis is knowing the data you are working with: the context in which it was collected, its strengths, its limitations,&nbsp;why it has value&nbsp;and how you relate to it.<\/p>\n\n\n\n<p>Have a healthy amount of skepticism&nbsp;as you move through the iterative process of crafting corpora and performing&nbsp;different text analysis methods. The analysis will always be influenced by what texts you choose to include (or not include)&nbsp;and the perspectives that you and your team bring to the research process.<\/p>\n\n\n\n<p>As you begin to build your corpus, consider:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What is the main goal of your&nbsp;research? What texts do you&nbsp;anticipate needing for this project?<\/li>\n\n\n\n<li>What kinds of patterns are you interested in exploring, and why?<\/li>\n\n\n\n<li>Whose perspectives are incorporated into this text corpus? What historical and social contexts informed the creation of the texts?&nbsp;How might&nbsp;this impact the&nbsp;analysis?&nbsp;<\/li>\n\n\n\n<li>Positionality:&nbsp;How do you&nbsp;(or the research team) relate to the concepts reflected in these texts?<\/li>\n\n\n\n<li>What assumptions do you&nbsp;have about the texts and the computational methodologies you&#8217;d like to use&nbsp;for analysis?<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Further reading<\/h3>\n\n\n\n<ul class=\"wp-block-list\" id=\"s-lg-link-list-67256665\">\n<li><a href=\"https:\/\/towardsdatascience.com\/a-dataset-is-a-worldview-5328216dd44d\">&#8220;A Dataset is a Worldview,&#8221; by Hannah Davis (2020)<\/a>This brief article outlines helpful ideas in critically framing data as inherently subjective.<\/li>\n\n\n\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=X5eAULsBm0w&amp;t=43s\">Text Analysis: A Walking Tour of What People are Using in Digital Humanities Right Now<\/a>: Miriam Posner explores commonly used text analysis methodologies and cautions that the nuanced relationships between words may not always be apparent with distant reading approaches.<\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:25%\">\n<figure class=\"wp-block-image aligncenter is-resized\"><img decoding=\"async\" width=\"125\" height=\"188\" src=\"https:\/\/live-digitalscholarship-library-cornell-edu.pantheonsite.io\/wp-content\/uploads\/2023\/12\/dig1-e1701894583255.jpeg\" alt=\"The Digital Humanities Coursebook by Johanna Drucker\" class=\"wp-image-28\" style=\"width:152px;height:auto\"\/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:75%\">\n<p>The Digital Humanities Coursebook&nbsp;by&nbsp;Johanna Drucker<br>ISBN: 9781003106531<br>Publication Date: 2021-03-24<br>Chapter 7: &#8220;Data Mining and Analysis&#8221; provides an overview of key concepts and histories in computational text analysis methodologies. It introduces critical approaches to thinking about social implications of data. It also features several exercises for text analysts of all skill levels.<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:25%\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img decoding=\"async\" width=\"125\" height=\"179\" src=\"https:\/\/live-digitalscholarship-library-cornell-edu.pantheonsite.io\/wp-content\/uploads\/2023\/12\/dig2-e1701894601292.jpeg\" alt=\"The Digital Black Atlantic by Roopika Risam (Editor); Kelly Baker Josephs (Editor)\" class=\"wp-image-27\" style=\"width:157px;height:auto\"\/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<p>The Digital Black Atlantic&nbsp;by&nbsp;Roopika Risam (Editor); <br>Kelly Baker Josephs (Editor)<br>ISBN: 9781452965307<br>Publication Date: 2021-03-16<br>Chapter 7: &#8220;Text Analysis for Thought in the Black Atlantic&#8221; assesses the limitations of text analysis methodologies as well as assumptions in understanding the meaning of words, centering perspectives from digital African diaspora studies.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Understanding your research goals In computational&nbsp;text analysis, text is your data, and&nbsp;the text corpus is your dataset. The most important part of any data analysis is knowing the data you are working with: the context in which it was collected, its strengths, its limitations,&nbsp;why it has value&nbsp;and how you relate to it. Have a healthy [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[3],"tags":[],"class_list":["post-26","post","type-post","status-publish","format-standard","hentry","category-text-as-data"],"_links":{"self":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts\/26","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=26"}],"version-history":[{"count":9,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts\/26\/revisions"}],"predecessor-version":[{"id":488,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=\/wp\/v2\/posts\/26\/revisions\/488"}],"wp:attachment":[{"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=26"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=26"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/digitalscholarship.library.cornell.edu\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=26"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}