We are evaluating Drill for making interactive SQL queries against customer 
sales transaction data. Many of our queries involve computing "penetration" 
numbers: count of unique customers, count of unique baskets, count of unique 
stores, etc. So far, using Drill to do aggregations involving COUNT, SUM, ... 
give acceptable query execution times. When including COUNT(DISTINCT <column>) 
in our queries, the execution times go from about 1 second to many minutes!

Has someone written a user-defined aggregate function to do approximate 
counting? We think a Bloom filter will serve our needs best.


-          Mike

Reply via email to