u500k.erinye.com - the AOL data analyzed

What's up with this?

In July 2006, AOL released 2 GB of log files from their search engine, covering the searches of about 658.000 US American AOL users from March to May 2006. The log files do not contain screen names of users but uniquely assigned ID numbers. The search queries are uncensored. This means that while no AOL user can be identified directly, it is in some cases possible to guess an ID's real identity from what she searched for. This is the kind of data the US government wanted from Google and which Google refused to provide.

AOL put it on the internet for everyone to see and download under the following license:

500k User Session Collection
----------------------------------------------
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. 
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

... parts snipped ...

Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson,  "A Picture of Search"  The First 
International Conference on Scalable Information Systems, Hong Kong, June, 
2006.

Copyright (2006) AOL
The full text of the license is also available. The article "A Picture of Search" appears to be available from Dr. Abdur Chowdhury's web page.

This was discovered in August 2006 and led to the archive being removed from the AOL reSearch web site. However, it had already found its way into the blogosphere:

Meanwhile, AOL has publicly apologized. However, other publicly available search data with similar content, but much less of it, remains available on the AOL reSearch test collections web site.

So ... what's up with this?

I've looked at the data from different angles: I've also analyzed some of the query histories: And there are some overview figures:

Conclusion and the Future

  1. It seems very hard to link any user ID to a real person, except in some very special cases.
  2. In many of those special cases, finding out the real person behind the searches won't offer any new information because the searches aren't interesting in any regard.
  3. It seems easy to link a user ID to a vague location, e.g. state or next major city.
Stay tuned for further updates as I explore new methods of evaluating the data!

Other AOL Data Analysis Web Sites