We had a similar challenge and we dealt with it by sampling based on the user id.

We have a unique id which is a random hexadecimal format- for instance A12890900.

The query is running  only on users that have an id that ends with 00.

At the end we multiply by 256 and get a pretty close number to the real number.



Itai


On 01/14/2011 01:14 PM, Radek Maciaszek wrote:

Hi,

I am working on some large scale unique users analysis (think hundreds of millions of records per day). Since number of all records per month goes into many billions I am hoping that there may be some alternative to running "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps deriving monthly numbers based on daily numbers of unique users with a use of statistical linear regression or a similar technique.

I tried to use sampling but that does not seem to give me the results I would expect. That is, for example if I sample every 1/100 record I would expect to see about 1% of the total number of unique users from a given period of time but instead I am seeing bigger numbers, for example something like 5%. I am not sure if the law of large number applies to unique users analysis.

Moreover it seems that "select distinct" does not scale linearly (which is understandable) and so doing monthly unique users analysis takes way longer than 30 x time to perform daily unique users analysis and requires a lot of memory.

I was wondering if anyone had a similar challenge?

Many thanks,
Radek

Reply via email to