@Ted, my point is exactly to add a dedicated aggregator (for example a UDAF) that would operate on a column that is already known to be sorted. The point is to not having to sort a potentially large set of values in each and every query. Could you point me how I could proceed with an implementation?
@ Jinfeng, I intend the feature to be used explicitly, i.e., selected by the user and not to expect the Drill to figure this out automatically at runtime. @ Jacques, I tried the setting but unfortunately the count distinct query I used before results in an IndexOutOfBounds exception: java.lang.IndexOutOfBoundsException: index: 0, length: 4 (expected: range(0, 0)) at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:187) at io.netty.buffer.DrillBuf.chk(DrillBuf.java:209) at io.netty.buffer.DrillBuf.getInt(DrillBuf.java:487) at org.apache.drill.exec.vector.VarBinaryVector$Mutator.setSafe(VarBinaryVector.java:502) at org.apache.drill.exec.test.generated.StreamingAggregatorGen2877.outputRecordKeys(StreamingAggTemplate.java:227) at org.apache.drill.exec.test.generated.StreamingAggregatorGen2877.outputToBatch(StreamingAggTemplate.java:300) at org.apache.drill.exec.test.generated.StreamingAggregatorGen2877.doWork(StreamingAggTemplate.java:142) at org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:127) at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142) at org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:68) at org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:152) at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:58) at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:163) ... 4 more The setup I'm using is a 10GB data set in the parquet format and the column used for unique value counting is delta-enconded 32B array (Parquet's FIXED_LEN_BYTE_ARRAY). Should I file a bug? Cheers, Marcin On Wed, Apr 8, 2015 at 12:45 AM, Jinfeng Ni <[email protected]> wrote: > ASAIK, Drill's planner currently does not expose the sort-ness of > underlying data; if the data is pre-sorted, Drill planner would not > recognize that, and still would require a sort operator for sort-based > aggregation. Part of the reason is that Drill does not have a centralized > meta-store, to keep track of the meta data of the data/tables/files. > Therefore, if the the underlying data does not expose the sort-ness, > planner does not have such knowledge. > > We do plan to enhance the planer / storage plugin interface, such that > planner would be able to utilize the sort-ness / partition columns > information. > > > > On Tue, Apr 7, 2015 at 3:25 PM, Ted Dunning <[email protected]> wrote: > > > Marcin, > > > > They did comment. The answer is that the default is to use hashed > > aggregation (which will be faster when there is lots of memory) with the > > option to use sort aggregation (which is basically what you were > > suggesting). > > > > Did you mean to suggest that your data is already known to be sorted and > > thus the sort step should be omitted? > > > > > > On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <[email protected]> > > wrote: > > > > > @Jacques, thanks for the information - I'm definitely going to check > out > > > that option. > > > > > > I'm also curious that none of you guys commented on my original idea of > > > counting distinct values by a simple aggregation of pre-sorted data - > is > > it > > > because it doesn't make sense to you guys, or because you think your > > > suggestions are easier to implement? > > > > > > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <[email protected]> > > wrote: > > > > > > > Two additional notes here: > > > > > > > > Drill can actually do an aggregation using either a hash table based > > > > aggregation or a sort based aggregation. By default, generally the > > hash > > > > aggregation will be selected first. However, you can disable hash > > based > > > > aggregation if you specifically think that a sort based aggregation > > will > > > > perform better for use case. You can do this by running the command > > > ALTER > > > > SESSION SET `planner.enable_hashagg` = FALSE; > > > > > > > > We have always had it on our roadmap to implement an approximate > count > > > > distinct function but haven't gotten to it yet. As Ted mentions, > using > > > > this technique would substantially reduce data shuffling and could be > > > done > > > > with a moderate level of effort since our UDAF interface is > pluggable. > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > > > How precise do your counts need to be? Can you accept a fraction > of > > a > > > > > percent statistical error? > > > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <[email protected]> > > > wrote: > > > > > > > > > > > Drill already does most of this type of transformation. If you > do > > an > > > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>' > > > > > > you will see that it first does a grouping on the column and then > > > > applies > > > > > > the COUNT(column). The first level grouping can be done either > > based > > > > on > > > > > > sorting or hashing and this is configurable through a system > > option. > > > > > > > > > > > > Aman > > > > > > > > > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski < > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Guys, > > > > > > > > > > > > > > I have a specific use case for Drill, in which I'd like to be > > able > > > to > > > > > > count > > > > > > > unique values in columns with tens millions of distinct values. > > The > > > > > COUNT > > > > > > > DISTINCT method, unfortunately, does not scale both time- and > > > > > memory-wise > > > > > > > and the idea is to sort the data beforehand by the values of > that > > > > > column > > > > > > > (let's call it ID), to have the row groups split at new a new > ID > > > > > boundary > > > > > > > and to extend Drill with an alternative version of COUNT that > > would > > > > > > simply > > > > > > > count the number of times the ID changes through out the entire > > > > table. > > > > > > This > > > > > > > way, we could expect that counting unique values of pre-sorted > > > > columns > > > > > > > could have complexity comparable to that of the regular COUNT > > > > operator > > > > > (a > > > > > > > full scan). So, to sum up, I have three questions: > > > > > > > > > > > > > > 1. Can such a scenario be realized in Drill? > > > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an > > > > > operator), > > > > > > > so without heavy hacking throughout entire Drill? > > > > > > > 3. How to do it? > > > > > > > > > > > > > > Our initial experience with Drill was very good - it's an > > excellent > > > > > tool. > > > > > > > But in order to be able to adopt it, we need to sort out this > one > > > > > central > > > > > > > issue. > > > > > > > > > > > > > > Cheers, > > > > > > > Marcin > > > > > > > > > > > > > > > > > > > > > > > > > > > >
