Re: Counting large numbers of unique values

Marcin Karpinski Sat, 11 Apr 2015 03:20:08 -0700

Thanks guys for all the answers. I was suspecting that parallelization
could wreck my approach but that was something I wanted to find out. In all
cases, I'm going to experiment more with the sort-based aggregation -
perhaps with enough computing power it will simply by sufficient. If not,
we can experiment with approximate counting - my understanding is that this
should fit the Drill's UDAF architecture. I'm also going to file a bug with
the IndexOutOfBoundsException I got.


Cheers,
Marcin

On Fri, Apr 10, 2015 at 6:06 PM, Jason Altekruse <[email protected]>
wrote:

> I'll open a doc ticket for this, but just a heads up, the UDF interfaces
> changed recently to remove the RecordBatch parameter from the setup method.
> This interface exposed too many system details to UDFs, to get the same
> system state to UDFs we are now exclusively using Injectable types.
>
> The available Injectable types are given and documented here (will also add
> this to the doc ticket):
>
> https://github.com/apache/drill/blob/af7a52beeaed565e243d2d54ba75337ab95b924d/exec/java-exec/src/main/java/org/apache/drill/exec/ops/UdfUtilities.java
>
>
>
> On Fri, Apr 10, 2015 at 7:51 AM, Abdel Hakim Deneche <
> [email protected]>
> wrote:
>
> > Drill Wiki is a good starting place:
> >
> >
> https://cwiki.apache.org/confluence/display/DRILL/Develop+Custom+Functions
> >
> > This page has links to 2 complete code examples of UDFs:
> >
> >
> >
> https://cwiki.apache.org/confluence/display/DRILL/Custom+Function+Interfaces
> >
> > On Thu, Apr 9, 2015 at 9:41 PM, Ted Dunning <[email protected]>
> wrote:
> >
> > > On Thu, Apr 9, 2015 at 1:38 AM, Adam Gilmore <[email protected]>
> > > wrote:
> > >
> > > > Ted - I'd be really interested in doing something like that
> > (approximate
> > > > aggregation results).  This would be very interesting in terms of
> > > standard
> > > > deviation, median, etc.
> > > >
> > >
> > > Yes.  This is a very common need.
> > >
> > > Writing a UDAF is a matter (mostly) of filling in a code generation
> > > template that accounts for the types you want to handle.  As is common,
> > > there is a state update call and a finalize call that gets the final
> > > result.
> > >
> > > Can anybody point Adam to an example?  Last time I tried this, I
> started
> > by
> > > finding the code for sum and worked from that.  Some pointers would
> save
> > > some dead ends relative to the process I went through.
> > >
> > >
> > >
> > >
> > > >
> > > > I know there is another project out there that trades off speed vs
> > > accuracy
> > > > (the name of which escapes me).  If we could easily add something
> onto
> > > > Drill like this, it'd be hugely beneficial.
> > > >
> > > > On Wed, Apr 8, 2015 at 8:25 AM, Ted Dunning <[email protected]>
> > > wrote:
> > > >
> > > > > Marcin,
> > > > >
> > > > > They did comment.  The answer is that the default is to use hashed
> > > > > aggregation (which will be faster when there is lots of memory)
> with
> > > the
> > > > > option to use sort aggregation (which is basically what you were
> > > > > suggesting).
> > > > >
> > > > > Did you mean to suggest that your data is already known to be
> sorted
> > > and
> > > > > thus the sort step should be omitted?
> > > > >
> > > > >
> > > > > On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > @Jacques, thanks for the information - I'm definitely going to
> > check
> > > > out
> > > > > > that option.
> > > > > >
> > > > > > I'm also curious that none of you guys commented on my original
> > idea
> > > of
> > > > > > counting distinct values by a simple aggregation of pre-sorted
> > data -
> > > > is
> > > > > it
> > > > > > because it doesn't make sense to you guys, or because you think
> > your
> > > > > > suggestions are easier to implement?
> > > > > >
> > > > > > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Two additional notes here:
> > > > > > >
> > > > > > > Drill can actually do an aggregation using either a hash table
> > > based
> > > > > > > aggregation or a sort based aggregation.  By default, generally
> > the
> > > > > hash
> > > > > > > aggregation will be selected first.  However, you can disable
> > hash
> > > > > based
> > > > > > > aggregation if you specifically think that a sort based
> > aggregation
> > > > > will
> > > > > > > perform better for use case.  You can do this by running the
> > > command
> > > > > > ALTER
> > > > > > > SESSION SET `planner.enable_hashagg` = FALSE;
> > > > > > >
> > > > > > > We have always had it on our roadmap to implement an
> approximate
> > > > count
> > > > > > > distinct function but haven't gotten to it yet.  As Ted
> mentions,
> > > > using
> > > > > > > this technique would substantially reduce data shuffling and
> > could
> > > be
> > > > > > done
> > > > > > > with a moderate level of effort since our UDAF interface is
> > > > pluggable.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > How precise do your counts need to be?  Can you accept a
> > fraction
> > > > of
> > > > > a
> > > > > > > > percent statistical error?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Drill already does most of this type of transformation.  If
> > you
> > > > do
> > > > > an
> > > > > > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>'
> > > > > > > > > you will see that it first does a grouping on the column
> and
> > > then
> > > > > > > applies
> > > > > > > > > the COUNT(column).  The first level grouping can be done
> > either
> > > > > based
> > > > > > > on
> > > > > > > > > sorting or hashing and this is configurable through a
> system
> > > > > option.
> > > > > > > > >
> > > > > > > > > Aman
> > > > > > > > >
> > > > > > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Guys,
> > > > > > > > > >
> > > > > > > > > > I have a specific use case for Drill, in which I'd like
> to
> > be
> > > > > able
> > > > > > to
> > > > > > > > > count
> > > > > > > > > > unique values in columns with tens millions of distinct
> > > values.
> > > > > The
> > > > > > > > COUNT
> > > > > > > > > > DISTINCT method, unfortunately, does not scale both time-
> > and
> > > > > > > > memory-wise
> > > > > > > > > > and the idea is to sort the data beforehand by the values
> > of
> > > > that
> > > > > > > > column
> > > > > > > > > > (let's call it ID), to have the row groups split at new a
> > new
> > > > ID
> > > > > > > > boundary
> > > > > > > > > > and to extend Drill with an alternative version of COUNT
> > that
> > > > > would
> > > > > > > > > simply
> > > > > > > > > > count the number of times the ID changes through out the
> > > entire
> > > > > > > table.
> > > > > > > > > This
> > > > > > > > > > way, we could expect that counting unique values of
> > > pre-sorted
> > > > > > > columns
> > > > > > > > > > could have complexity comparable to that of the regular
> > COUNT
> > > > > > > operator
> > > > > > > > (a
> > > > > > > > > > full scan). So, to sum up, I have three questions:
> > > > > > > > > >
> > > > > > > > > > 1. Can such a scenario be realized in Drill?
> > > > > > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF
> or
> > > an
> > > > > > > > operator),
> > > > > > > > > > so without heavy hacking throughout entire Drill?
> > > > > > > > > > 3. How to do it?
> > > > > > > > > >
> > > > > > > > > > Our initial experience with Drill was very good - it's an
> > > > > excellent
> > > > > > > > tool.
> > > > > > > > > > But in order to be able to adopt it, we need to sort out
> > this
> > > > one
> > > > > > > > central
> > > > > > > > > > issue.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Marcin
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Re: Counting large numbers of unique values

Reply via email to