Drill already does most of this type of transformation.  If you do an
'EXPLAIN PLAN FOR <your count(distinct) query>'
you will see that it first does a grouping on the column and then applies
the COUNT(column).  The first level grouping can be done either based on
sorting or hashing and this is configurable through a system option.

Aman

On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <[email protected]>
wrote:

> Hi Guys,
>
> I have a specific use case for Drill, in which I'd like to be able to count
> unique values in columns with tens millions of distinct values. The COUNT
> DISTINCT method, unfortunately, does not scale both time- and memory-wise
> and the idea is to sort the data beforehand by the values of that column
> (let's call it ID), to have the row groups split at new a new ID boundary
> and to extend Drill with an alternative version of COUNT that would simply
> count the number of times the ID changes through out the entire table. This
> way, we could expect that counting unique values of pre-sorted columns
> could have complexity comparable to that of the regular COUNT operator (a
> full scan). So, to sum up, I have three questions:
>
> 1. Can such a scenario be realized in Drill?
> 2. Can it be done in a modular way (eg, a dedicated UDAF or an operator),
> so without heavy hacking throughout entire Drill?
> 3. How to do it?
>
> Our initial experience with Drill was very good - it's an excellent tool.
> But in order to be able to adopt it, we need to sort out this one central
> issue.
>
> Cheers,
> Marcin
>

Reply via email to