Thanks guys for all the answers. I was suspecting that parallelization could wreck my approach but that was something I wanted to find out. In all cases, I'm going to experiment more with the sort-based aggregation - perhaps with enough computing power it will simply by sufficient. If not, we can experiment with approximate counting - my understanding is that this should fit the Drill's UDAF architecture. I'm also going to file a bug with the IndexOutOfBoundsException I got.
Cheers, Marcin On Fri, Apr 10, 2015 at 6:06 PM, Jason Altekruse <[email protected]> wrote: > I'll open a doc ticket for this, but just a heads up, the UDF interfaces > changed recently to remove the RecordBatch parameter from the setup method. > This interface exposed too many system details to UDFs, to get the same > system state to UDFs we are now exclusively using Injectable types. > > The available Injectable types are given and documented here (will also add > this to the doc ticket): > > https://github.com/apache/drill/blob/af7a52beeaed565e243d2d54ba75337ab95b924d/exec/java-exec/src/main/java/org/apache/drill/exec/ops/UdfUtilities.java > > > > On Fri, Apr 10, 2015 at 7:51 AM, Abdel Hakim Deneche < > [email protected]> > wrote: > > > Drill Wiki is a good starting place: > > > > > https://cwiki.apache.org/confluence/display/DRILL/Develop+Custom+Functions > > > > This page has links to 2 complete code examples of UDFs: > > > > > > > https://cwiki.apache.org/confluence/display/DRILL/Custom+Function+Interfaces > > > > On Thu, Apr 9, 2015 at 9:41 PM, Ted Dunning <[email protected]> > wrote: > > > > > On Thu, Apr 9, 2015 at 1:38 AM, Adam Gilmore <[email protected]> > > > wrote: > > > > > > > Ted - I'd be really interested in doing something like that > > (approximate > > > > aggregation results). This would be very interesting in terms of > > > standard > > > > deviation, median, etc. > > > > > > > > > > Yes. This is a very common need. > > > > > > Writing a UDAF is a matter (mostly) of filling in a code generation > > > template that accounts for the types you want to handle. As is common, > > > there is a state update call and a finalize call that gets the final > > > result. > > > > > > Can anybody point Adam to an example? Last time I tried this, I > started > > by > > > finding the code for sum and worked from that. Some pointers would > save > > > some dead ends relative to the process I went through. > > > > > > > > > > > > > > > > > > > > I know there is another project out there that trades off speed vs > > > accuracy > > > > (the name of which escapes me). If we could easily add something > onto > > > > Drill like this, it'd be hugely beneficial. > > > > > > > > On Wed, Apr 8, 2015 at 8:25 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > > > Marcin, > > > > > > > > > > They did comment. The answer is that the default is to use hashed > > > > > aggregation (which will be faster when there is lots of memory) > with > > > the > > > > > option to use sort aggregation (which is basically what you were > > > > > suggesting). > > > > > > > > > > Did you mean to suggest that your data is already known to be > sorted > > > and > > > > > thus the sort step should be omitted? > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > @Jacques, thanks for the information - I'm definitely going to > > check > > > > out > > > > > > that option. > > > > > > > > > > > > I'm also curious that none of you guys commented on my original > > idea > > > of > > > > > > counting distinct values by a simple aggregation of pre-sorted > > data - > > > > is > > > > > it > > > > > > because it doesn't make sense to you guys, or because you think > > your > > > > > > suggestions are easier to implement? > > > > > > > > > > > > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > Two additional notes here: > > > > > > > > > > > > > > Drill can actually do an aggregation using either a hash table > > > based > > > > > > > aggregation or a sort based aggregation. By default, generally > > the > > > > > hash > > > > > > > aggregation will be selected first. However, you can disable > > hash > > > > > based > > > > > > > aggregation if you specifically think that a sort based > > aggregation > > > > > will > > > > > > > perform better for use case. You can do this by running the > > > command > > > > > > ALTER > > > > > > > SESSION SET `planner.enable_hashagg` = FALSE; > > > > > > > > > > > > > > We have always had it on our roadmap to implement an > approximate > > > > count > > > > > > > distinct function but haven't gotten to it yet. As Ted > mentions, > > > > using > > > > > > > this technique would substantially reduce data shuffling and > > could > > > be > > > > > > done > > > > > > > with a moderate level of effort since our UDAF interface is > > > > pluggable. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > How precise do your counts need to be? Can you accept a > > fraction > > > > of > > > > > a > > > > > > > > percent statistical error? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > > > Drill already does most of this type of transformation. If > > you > > > > do > > > > > an > > > > > > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>' > > > > > > > > > you will see that it first does a grouping on the column > and > > > then > > > > > > > applies > > > > > > > > > the COUNT(column). The first level grouping can be done > > either > > > > > based > > > > > > > on > > > > > > > > > sorting or hashing and this is configurable through a > system > > > > > option. > > > > > > > > > > > > > > > > > > Aman > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski < > > > > > > [email protected] > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Guys, > > > > > > > > > > > > > > > > > > > > I have a specific use case for Drill, in which I'd like > to > > be > > > > > able > > > > > > to > > > > > > > > > count > > > > > > > > > > unique values in columns with tens millions of distinct > > > values. > > > > > The > > > > > > > > COUNT > > > > > > > > > > DISTINCT method, unfortunately, does not scale both time- > > and > > > > > > > > memory-wise > > > > > > > > > > and the idea is to sort the data beforehand by the values > > of > > > > that > > > > > > > > column > > > > > > > > > > (let's call it ID), to have the row groups split at new a > > new > > > > ID > > > > > > > > boundary > > > > > > > > > > and to extend Drill with an alternative version of COUNT > > that > > > > > would > > > > > > > > > simply > > > > > > > > > > count the number of times the ID changes through out the > > > entire > > > > > > > table. > > > > > > > > > This > > > > > > > > > > way, we could expect that counting unique values of > > > pre-sorted > > > > > > > columns > > > > > > > > > > could have complexity comparable to that of the regular > > COUNT > > > > > > > operator > > > > > > > > (a > > > > > > > > > > full scan). So, to sum up, I have three questions: > > > > > > > > > > > > > > > > > > > > 1. Can such a scenario be realized in Drill? > > > > > > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF > or > > > an > > > > > > > > operator), > > > > > > > > > > so without heavy hacking throughout entire Drill? > > > > > > > > > > 3. How to do it? > > > > > > > > > > > > > > > > > > > > Our initial experience with Drill was very good - it's an > > > > > excellent > > > > > > > > tool. > > > > > > > > > > But in order to be able to adopt it, we need to sort out > > this > > > > one > > > > > > > > central > > > > > > > > > > issue. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Marcin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Abdelhakim Deneche > > > > Software Engineer > > > > <http://www.mapr.com/> > > > > > > Now Available - Free Hadoop On-Demand Training > > < > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > >
