We are evaluating Drill for making interactive SQL queries against customer sales transaction data. Many of our queries involve computing "penetration" numbers: count of unique customers, count of unique baskets, count of unique stores, etc. So far, using Drill to do aggregations involving COUNT, SUM, ... give acceptable query execution times. When including COUNT(DISTINCT <column>) in our queries, the execution times go from about 1 second to many minutes!
Has someone written a user-defined aggregate function to do approximate counting? We think a Bloom filter will serve our needs best. - Mike
