Complexity Explorer Santa Few Institute

Foundations & Applications of Humanities Analytics (fall 2021)

Lead instructor:

This course is no longer in session.

8.1 Case Study: Capitalism & Democracy » Test Your Knowledge part 1: Explanations

Q1. How do we operationalize "attention to a concept" in a certain year?

A.  The cumulative amount of time people spend thinking about the concept that year.
B.  The number of times the word appears in the newspaper that year.
C.  The number of articles that contain the word that appear in the newspaper that year.

D.  The fraction of articles that contain the word that appear in the newspaper that year.

Correct Answer: (D)   One way to think about this quantity is that it corresponds (roughly) to "the probability that a person, opening the newspaper at random, will see an article containing the word." (C) is close, but not correct; the reason we might not want to use (C) as an operationalization is that the "concept"-containing number will be sensitive to the total number of articles published that year. The total number of articles may change dramatically given the publication schedule – for example, imagine that the newspaper starts printing a Sunday section: the number of articles containing the word might increase by 13%, but at the same time, the total number of articles might also increase by 13%. It would seem unfounded to say that people are increasing their attention to the word/concept without relating it to something else. (B) is close, but not correct; (B) is essentially answer (C), with a subtle difference of counting the total number of words, as opposed to articles – but like (C) and unlike correct answer (D), (B) does not relate the number of words to anything else. In choosing (D) as the strategy for the study presented, there is an assumption that the "article level", rather than the "word level", is the correct one. Finally, (A) would be a lovely thing, but we cannot measure it directly, and so it's not an operationalization.


Q2. What was the "and" and the "the" stuff discussed in the lecture [11:25]?

A. For some reason, the New York Times doesn't want to tell us how many articles are in the database for a particular year, so we cheat the system by searching for a common word, assuming that the number of times we find that common word corresponds to the total number of articles that year.
B. Looking at the weird articles that contain "and" but not "the", or vice versa, helps us understand the archive and spot weird complications in the newspaper's corpus over time.
C. Looking at the different counts for "and" and "the" gives us a sense of the kinds of uncertainty in measuring our operationalization.

D. All of the above.

Correct Answer: (D)   Yes! We found a way to trick the New York Times search engine, but are we actually winning? In testing out the trick (by trying different "very common words" such as "and" and "the"), we learn about the quantitative errors (subtle differences between fractions), and as a bonus we learn useful idiosyncracies of the system. Idiosyncracies really matter: as any historian will tell you, 99% of history is just idiosyncracies glued together.