Big Data Becomes a Mirror


Book Review of ‘Uncharted,’ by Erez Aiden and Jean-Baptiste Michel in the New York Times: “Why do English speakers say “drove” rather than “drived”?

As graduate students at the Harvard Program for Evolutionary Dynamics about eight years ago, Erez Aiden and Jean-Baptiste Michel pondered the matter and decided that something like natural selection might be at work. In English, the “-ed” past-tense ending of Proto-Germanic, like a superior life form, drove out the Proto-Indo-European system of indicating tenses by vowel changes. Only the small class of verbs we know as irregular managed to resist.

To test this evolutionary premise, Mr. Aiden and Mr. Michel wound up inventing something they call culturomics, the use of huge amounts of digital information to track changes in language, culture and history. Their quest is the subject of “Uncharted: Big Data as a Lens on Human Culture,” an entertaining tour of the authors’ big-data adventure, whose implications they wildly oversell….

Invigorated by the great verb chase, Mr. Aiden and Mr. Michel went hunting for bigger game. Given a large enough storehouse of words and a fine filter, would it be possible to see cultural change at the micro level, to follow minute fluctuations in human thought processes and activities? Tiny factoids, multiplied endlessly, might assume imposing dimensions.

By chance, Google Books, the megaproject to digitize every page of every book ever printed — all 130 million of them — was starting to roll just as the authors were looking for their next target of inquiry.

Meetings were held, deals were struck and the authors got to it. In 2010, working with Google, they perfected the Ngram Viewer, which takes its name from the computer-science term for a word or phrase. This “robot historian,” as they call it, can search the 30 million volumes already digitized by Google Books and instantly generate a usage-frequency timeline for any word, phrase, date or name, a sort of stock-market graph illustrating the ups and downs of cultural shares over time.

Mr. Aiden, now director of the Center for Genome Architecture at Rice University, and Mr. Michel, who went on to start the data-science company Quantified Labs, play the Ngram Viewer (books.google.com/ngrams) like a Wurlitzer…

The Ngram Viewer delivers the what and the when but not the why. Take the case of specific years. All years get attention as they approach, peak when they arrive, then taper off as succeeding years occupy the attention of the public. Mentions of the year 1872 had declined by half in 1896, a slow fade that took 23 years. The year 1973 completed the same trajectory in less than half the time.

“What caused that change?” the authors ask. “We don’t know. For now, all we have are the naked correlations: what we uncover when we look at collective memory through the digital lens of our new scope.” Someone else is going to have to do the heavy lifting.”