We had a similar challenge and we dealt with it by sampling based on
the user id.
We have a unique id which is a random hexadecimal format- for instance
A12890900.
The query is running only on users that have an id that ends with 00.
At the end we multiply by 256 and get a pretty close number to the real
number.
Itai
On 01/14/2011 01:14 PM, Radek Maciaszek wrote:
Hi,
I am working on some large scale unique users analysis (think hundreds
of millions of records per day). Since number of all records per month
goes into many billions I am hoping that there may be some alternative
to running "SELECT DISTINCT user_unique_id..." such as sampling data
or perhaps deriving monthly numbers based on daily numbers of unique
users with a use of statistical linear regression or a similar technique.
I tried to use sampling but that does not seem to give me the results
I would expect. That is, for example if I sample every 1/100 record I
would expect to see about 1% of the total number of unique users from
a given period of time but instead I am seeing bigger numbers, for
example something like 5%. I am not sure if the law of large number
applies to unique users analysis.
Moreover it seems that "select distinct" does not scale linearly
(which is understandable) and so doing monthly unique users analysis
takes way longer than 30 x time to perform daily unique users analysis
and requires a lot of memory.
I was wondering if anyone had a similar challenge?
Many thanks,
Radek