u500k.erinye.com - the AOL data analyzed

Random Facts about the Dataset

Habits of AOL Users by Percentage

Constraints

What about the Skewness?

Why is the median so much lower than the mean and what does it mean? And what does mean mean? The mean is the average number of queries, that is, all the queries summed up and divided by the number of users. As shown here, there is one user that accounts for roughly 280.000 recorded queries, which is clearly a lot more than 55. This figure is also larger than the number of queries of the user with the second highest number by two orders of magnitude. Many of the other AOL accounts appear to be shared accounts, too. Records like these accounts' records skew the distribution of query numbers. The following illustration shows how many users had more than a given number of queries recorded. Note two things: First, users with more than two queries were also counted as users with more than one query, and second, the y scale is logarithmic.

users and their number of queries

The median, on the other hand, is the number of queries which has roughly an equal number of users above it and below it. In a sense, the median represents the typical number of queries that were recorded for any of the 658.000 users. The following graph shows the same information, on a linear scale, for numbers of queries around the median, up to 30. Unsurprisingly, all users had more than no queries recorded for them. The green line is at 50% of the total number of users. The blue line shows that only 77% of all users had at least 5 queries recorded for them.

users and their number of queries

Implications

One of the implications is that for any research that requires more than 17 non-unique searches, more than half of the users (some of them, roughly 9000, being recorded with exactly 17 searches) are out. Suppose you want to build profiles. The famous user 4417749 had 454 of her queries recorded. The plot above shows that only about 10.000 users had at least this many queries recorded. Even going down to a minimum of 200 queries only results in a four-fold increase of users to evaluate. Additional constraints further limit this figure. For example, only about 800 had at least 454 queries recorded and at the same time entered queries that are as relevant to US states as user 4417749's, wo searched a lot for Georgia. Lowering these limits again results in only about four times as many users.

These numbers may still sound like a lot, until you realize that they amount to, with optimistic limits, only about 6 percent of the total number of unique users whose search tracks were made available by AOL.

Looking at Unique Queries

In the AOL dataset, queries appear when users clicked the "next page" button or when they clicked a link (because exit links were recorded). This means that many repeated queries actually resulted from one search and that must be taken into account in some cases.

users and their number of unique queries

Interestingly, when comparing the numbers of users above the 239 unique queries of user 4417749 to the previous results, it turns out they are nearly identical. This is also true for the other comparisons made above. Read more about this here.