Complexity Explorer Santa Few Institute

Foundations & Applications of Humanities Analytics (fall 2021)

Lead instructor: David Kinney & Simon DeDeo

This course is no longer in session.

14.1 Application: Blurbs » Chapter 9 Overview

What you will learn in this chapter

You will be walked through the entirety of a miniature project in humanities analytics, from its initial inception to the posing of a research question and choice of methodology, all the way through to data collection and the initial analysis of results. The project in question involves producing two topic models: one of user preferences on the website bookcrossing.com, and one topic model of the marketing blurbs on the back covers books accessed using the same website. Using these two topic models, we are able to identify a small but comprehensive set of reader types, as well as a small but comprehensive set of blurb types. We are able to evaluate the degree of similarity between different reader types, and the degree of similarity between different blurb types. Finally we are able to evaluate the extent to which salient marketing blurbs types are (and are not) matched to different reader types in the context of the bookcrossing.com dataset.


Key terms to keep in mind


bookcrossing.com   A website where people can “register” books they own as a prelude to leaving them (in cafés, hotels, etc) for others to pick up. The source of our data on what books people possess.


Culture Industry   The network of institutions that collectively produce “popular” culture artifacts like genre fiction, television series, and mass market films. A term borrowed from Theodor Adorno and Max Horkeimer’s 1947 book, where it was used in a pejorative fashion: the culture industry produces the kind of books that a critic like Harold Bloom would not consider “literature” at all.


Experience machines   The idea that a book is valued primarily for the experiences it generates while being read. An informal term from Simon, borrowed from Robert Nozick’s 1974 book, where "experience machine" referred to a complete simulated reality.


Blurb   The text on the back of a book, used as part of the book’s marketing. A blurb might include, for example, an appealing plot summary, quotes from positive reviews, or information about the author.


Taxonomy of Desires   The different kinds of needs that readers might have, and that express themselves in the books they purchase and consume.


Distribution   A (probability) distribution is a list of objects with associated probabilities. For example, a distribution over coin toss outcomes might be “50% heads, 50% tails”. In this lecture, we consider distributions over books; for example, a topic in the palette of needs (see below) is a distribution over the 48,000 books in the dataset, where some books are more likely than others.


Palette of Needs   A collection of distributions over books, found by a topic model analysis of the book ownership patterns revealed by the bookcrossing.com website. In our analysis, the “palette” has ten distinct distributions, which we interpret as a taxonomy of desires (see above).


Palette of Blurbs   A collection of distributions over words, corresponding to the different types of marketing language a publisher might include on the back of a book.


Topic Model   The formal name for the statistical model used in this lecture, for both the “palette of needs” analysis, and the “palette of blurbs” analysis.


Back-Inference  The lecture describes the topic model for both the palette of needs and palette of blurbs in a “generative” fashion, meaning that (for example) we describe how it decomposed a reader’s book-registering patterns into a combination of needs, which are distributions over books. The magic of the topic model is its ability to “back infer” the topics (needs) that best describe the data. We say “back infer” because the topic model goes from the data to the underlying model (rather than inferring, for example, what data might be generated by a model we already know.)