Hi Itai, I did not think about sampling users instead of sampling records, but it makes a much more sense indeed. As it happens my ID is also hexadecimal and so I did exactly what you suggested. In the results my error is less than 1% comparing to observed values!
Many thanks!! Radek On 14 January 2011 11:32, Itai Hochman <[email protected]> wrote: > We had a similar challenge and we dealt with it by sampling based on the > user id. > > We have a unique id which is a random hexadecimal format- for instance > A12890900. > > The query is running only on users that have an id that ends with 00. > > At the end we multiply by 256 and get a pretty close number to the real > number. > > > > Itai > > > > On 01/14/2011 01:14 PM, Radek Maciaszek wrote: > > Hi, >> >> I am working on some large scale unique users analysis (think hundreds of >> millions of records per day). Since number of all records per month goes >> into many billions I am hoping that there may be some alternative to running >> "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps >> deriving monthly numbers based on daily numbers of unique users with a use >> of statistical linear regression or a similar technique. >> >> I tried to use sampling but that does not seem to give me the results I >> would expect. That is, for example if I sample every 1/100 record I would >> expect to see about 1% of the total number of unique users from a given >> period of time but instead I am seeing bigger numbers, for example something >> like 5%. I am not sure if the law of large number applies to unique users >> analysis. >> >> Moreover it seems that "select distinct" does not scale linearly (which is >> understandable) and so doing monthly unique users analysis takes way longer >> than 30 x time to perform daily unique users analysis and requires a lot of >> memory. >> >> I was wondering if anyone had a similar challenge? >> >> Many thanks, >> Radek >> > >
