The Data Know

Another sign of the hidden untapped powers of the web surfaced last week. The New York Times reported that Google is developing methods of analyzing search data to track regional outbreaks of the flu. From the Times:

There is a new common symptom of the flu, in addition to the usual aches, coughs, fevers and sore throats. Turns out a lot of ailing Americans enter phrases like “flu symptoms” into Google and other search engines before they call their doctors.

That simple act, multiplied across millions of keyboards in homes around the country, has given rise to a new early warning system for fast-spreading flu outbreaks, called Google Flu Trends.

The report was another symptom of what is becoming a much bigger story. In June, Wired editor Chris Anderson published a controversial polemic (more here) defining what he called the “Petabyte Age.” He wrote:

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The lesson is that raw, digital data, of which there is now an overwhelming overabundance, contains information. The challenge is how to recognize and interpret the patterns hidden within it.

Anderson’s argument is that this approach could signal the end of the scientific method, which relies on hypothesis, testing, and reproducibility. In the Petabyte Age, he says, researchers are moving away from understanding things in a mechanistic way, and turning to statistics to provide models of how they work, even if the details remain inscrutable.

Reading the Times story reminded me of a conversation I recently had with Tony Jebara, a computer scientist at Columbia University who specializes in machine learning. His academic work has focused on computer vision and facial recognition, though in his extracurricular activities, he is also chief scientist and co-founder of Sense Networks.

Sense has developed a platform that collects data from GPS, WiFi positioning, cell phones, and RFID in real time. It then analyzes this information to predict things like consumer movements in shopping districts over the course of a day, or which bars in a city are most active on a particular night. The application could become a powerful tool for analyzing how communities function, purely on the basis of some very basic, anonymous information.

Such work provides a glimpse of the unique opportunities this new kind of science, made possible by the networks the web has created, could offer. The computer scientists behind Google Flu Trends beat epidemiologists to the punch with a search engine, based on the assumption that patterns that emerge from individuals’ simplest activities mean something. And where questions about how communities behave would once have been answered by an anthropologist, we now have evidence of how automated algorithms can reveal group dynamics, based on the simple facts of people’s most mundane, day-to-day actions.