2.1 Introduction to Humanities Analytics » Chapter 1 Overview
What you will learn in this chapter
By exploring three case studies in humanities analytics scholarship, (1) you will begin to understand how correlations among words and patterns of words can be used to support a scholarly argument, and (2) you will be able to compare the correlative features used in different studies and why those features were chosen. In the assignment, (3) you will propose how you might apply correlations to your own work to support (or refute) an argument of interest.
Key terms to keep in mind
Signal When a fact X is a signal of a fact Y, we mean simply that knowing X tells us something, or reduces our uncertainty about, Y. This usage contrasts a little with the standard use, where a signal often indicates some kind of intentionality (X is about Y), or agency (a person uses X deliberately to inform you about Y), or causality (X signals Y only if, for example, X preceeds Y in time).
Two examples of how signal is used in the broader sense:
(1) "Zip code is a signal of income." This means that if I know the zip code (postal code) that you live in, I gain some information about your income. I won't necessarily know your income precisely, but it will lead me to refine my beliefs about your income. For example, if your zip code is associated with Greenwich, Connecticut (a fancy part of the East Coast of the USA), I'll consider it more likely than usual that your income is high.
If zip code is a signal of income it means that "in general" knowledge of zip code helps improve the accuracy of the belief about income; however, signals can be imperfect. An imperfect signal may only give a little information for everyone, or it may sometimes work great, and other times not at all.
(2) "Income is a signal of zip code." Signal relationships are (usually) symmetric: if knowledge of X tells you about Y, then knowledge of Y tells you about X. Knowing that someone's high-income, for example, tells you that they're more likely to live in one of a small number of zip codes usually located in fancy parts of major coastal cities, or in vacation spots near by.
In information theory (see "information-theoretic" below), this symmetry is enforced precisely; if you measure the strength of the signal that X gives you for Y, it's precisely equal to the strength of the signal that Y gives you for X.
Prediction X predicts Y when X is a signal of Y. Prediction can be retrodiction, meaning that we might say that X predicts Y even when X comes after Y. In general, we talk about prediction from the point of view of an omniscient observer.
Correlation A correlation is a relationship between two quantities such that the value of one is a signal of the value of the other. It's an example of a signal relationship. The word correlation is used in a wide variety of senses, and often simply means that there is a "co-relation" between the two things in question.
Very often, when people use the word correlated, they mean "linear correlation", which means that the signal has a particular form. A "positive correlation" in this sense means that if X is high (or higher than usual), then Y is high (or higher than usual), and that if X is low (or lower than usual), then Y is low (or lower than usual).
Signal, correlation, and prediction are all roughly synonymous terms, but tend to be used in different contexts. For example, we tend to talk in terms of signals when there are many different features of a system, and we're interested in their relationships. We tend to talk in terms of correlations when those signals are quantitative in form. And we tend to talk in terms of prediction when there's something in particular we care about especially, and we're interested in how different signals combine to tell us about it.
For example, income, exam performance, zip code, and race are all signals of each other. But we might be very interested in asking about how "income, race, and zip code" combine together to help us predict performance on an exam.
Information-theoretic As we'll discuss later in the course, pioneering work by Claude Shannon (at AT&T Bell Labs), coupled with insights gleaned during wartime cryptography code-breaking tasks, led to the development of a new science of signals, patterns, and prediction, called Information Theory. An information-theoretic account of an archive, then, is an investigation built around the idea of discovering patterns and signals, and understanding the underlying events in terms of how those patterns and signals interact.
Operationalization Turning ideas into something we can measure off a data set. In any study, we are often in the business of taking a "thick", culturally situated and complex idea, and trying to capture some aspect of it in a quantitative form.
To take an example from well before the computer era, you might operationalize membership in a particular social class by tracking visible markers (income, make and model of car, media consumption). What it means to be a member of this or that class is a complex, interpretative matter; but tracking how many times a person has been to the opera is not. You can count the latter, and (the bargain goes) facts about those numbers may illuminate facts about the deeper concepts. For example, counting opera-going might be used to measure how immigrants move up the social class ladder across generations.
Crucially, operationalization is not definition. A good operationalization does not redefine the concept of interest (it does not say "to be a member of the Russian intelligentsia is just to have gone to the opera at least once"). Rather, it makes an argument for why the concept, as best understood, may lead to certain measurable consequences, and why those measurements might provide a signal of the underlying concept.
Operationalizations get pretty sophisticated. For example, a common technique in natural language processing is to operationalize certain semantic concepts (e.g., "synonym") in terms of syntactic structure (two words that tend to occur nearby in a sentence are more likely to be synonyms, etc). This is what word2vec does. [For more on operationalization in an natural-language processing context, see the suggested reading for Chapter 1.]
A good operationalization can provide a completely new window onto a long-standing debate. Conversely, a bad operationalization can mislead and misrepresent what's really going on. For a higher stakes example than opera, for example, consider the way Western culture operationalized "intelligence" as performance on a timed test of geometric reasoning.
Hermeneutic circle In traditional humanities scholarship, the hermeneutic circle refers to the way in which we understand some part of a text in terms of our ideas about its overall structure and meaning -- but that we also, in a cyclic fashion, update our beliefs about the overall structure and meaning of a text in response to particular moments.
You can think of a similar process happening in the case of operationalization. We operationalize a concept by picking out some feature of it -- often a small, contingent feature -- and seeing what we can learn. In a good investigation, however, what we learn by looking at that quantified feature then feeds back to influence how we understand the concept we began with.
We might, for example, learn that our operationalization, which seemed so justified at the start, only seemed justified because we didn't understand the concept well enough. Or, happily, the quantitative results might deepen our understanding of the concept – which, in turn, suggest new ways to refine and improve the operationalization.