Re: Unique users analysis

Itai Hochman Fri, 14 Jan 2011 03:33:19 -0800

We had a similar challenge and we dealt with it by sampling based onthe user id.

We have a unique id which is a random hexadecimal format- for instanceA12890900.


The query is running  only on users that have an id that ends with 00.

At the end we multiply by 256 and get a pretty close number to the realnumber.




Itai


On 01/14/2011 01:14 PM, Radek Maciaszek wrote:

Hi,
I am working on some large scale unique users analysis (think hundredsof millions of records per day). Since number of all records per monthgoes into many billions I am hoping that there may be some alternativeto running "SELECT DISTINCT user_unique_id..." such as sampling dataor perhaps deriving monthly numbers based on daily numbers of uniqueusers with a use of statistical linear regression or a similar technique.
I tried to use sampling but that does not seem to give me the resultsI would expect. That is, for example if I sample every 1/100 record Iwould expect to see about 1% of the total number of unique users froma given period of time but instead I am seeing bigger numbers, forexample something like 5%. I am not sure if the law of large numberapplies to unique users analysis.
Moreover it seems that "select distinct" does not scale linearly(which is understandable) and so doing monthly unique users analysistakes way longer than 30 x time to perform daily unique users analysisand requires a lot of memory.
I was wondering if anyone had a similar challenge?

Many thanks,
Radek

Re: Unique users analysis

Reply via email to