Random Facts about the Dataset
Habits of AOL Users by Percentage
29% searched for locations in the United States
at least 13% pasted at least one link into the search field by mistake
at least 6% searched for child porn
at least 3.5% searched for other porn
928 users, 0.14% of all recorded users, searched for alcoholics anonymous
Constraints
657426 unique users recorded
maximum number of recorded queries of a single user 279430
minimum number of recorded queries of a single user 1
average number of queries per user 55
queries median 17 (about half of the users were recorded with more than 17 queries)
number of users with only one recorded query 56959 (8.7%)
What about the Skewness?
Why is the median so much lower than the mean and what does it mean? And what does mean mean? The mean is
the average number of queries, that is, all the queries summed up and divided by the number of users.
As shown here, there is one user that accounts for
roughly 280.000 recorded queries, which is clearly a lot more than 55. This figure is also larger than
the number of queries of the user with the second highest number by two orders of magnitude.
Many of the other AOL accounts appear to be shared accounts, too. Records like these accounts' records
skew the distribution of query numbers. The following illustration shows how many users had more than
a given number of queries recorded. Note two things: First, users with more than two queries were also counted
as users with more than one query, and second, the y scale is logarithmic.
The median, on the other hand, is the number of queries which has roughly an equal number of users above it
and below it. In a sense, the median represents the typical number of queries that were recorded for any of the
658.000 users. The following graph shows the same information, on a linear scale, for numbers of queries around the median, up to 30.
Unsurprisingly, all users had more than no queries recorded for them. The green line is at 50% of the total number of
users. The blue line shows that only 77% of all users had at least 5 queries recorded for them.
Implications
One of the implications is that for any research that requires more than 17 non-unique searches, more than
half of the users (some of them, roughly 9000, being recorded with exactly 17 searches) are out. Suppose you want
to build profiles. The famous user 4417749 had 454 of her queries
recorded. The plot above shows that only about 10.000 users had at least this many queries recorded. Even going down
to a minimum of 200 queries only results in a four-fold increase of users to evaluate. Additional constraints
further limit this figure. For example, only about 800 had at least 454 queries recorded and at the same time entered
queries that are as relevant to US states as user 4417749's, wo searched a lot for Georgia. Lowering these limits
again results in only about four times as many users.
These numbers may still sound like a lot, until you realize that they amount to, with optimistic limits, only about 6 percent
of the total number of unique users whose search tracks were made available by AOL.
Looking at Unique Queries
maximum number of unique queries of a single user 216117
average number of unique queries per user 32
unique queries median 12
number of users with only one recorded query 73770 (11.2%)
In the AOL dataset, queries appear when users clicked the "next page" button or when they clicked a link (because
exit links were recorded). This means that many repeated queries actually resulted from one search and that must be taken
into account in some cases.
Interestingly, when comparing the numbers of users above the 239 unique queries of user 4417749 to the previous results, it turns out
they are nearly identical. This is also true for the other comparisons made above. Read more about this
here.