Re: Counting large numbers of unique values

Marcin Karpinski Wed, 08 Apr 2015 06:40:32 -0700

@Ted, my point is exactly to add a dedicated aggregator (for example a
UDAF) that would operate on a column that is already known to be sorted.
The point is to not having to sort a potentially large set of values in
each and every query. Could you point me how I could proceed with an
implementation?


@ Jinfeng, I intend the feature to be used explicitly, i.e., selected by
the user and not to expect the Drill to figure this out automatically at
runtime.

@ Jacques, I tried the setting but unfortunately the count distinct query I
used before results in an IndexOutOfBounds exception:

java.lang.IndexOutOfBoundsException: index: 0, length: 4 (expected:
range(0, 0))
at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:187)
at io.netty.buffer.DrillBuf.chk(DrillBuf.java:209)
at io.netty.buffer.DrillBuf.getInt(DrillBuf.java:487)
at
org.apache.drill.exec.vector.VarBinaryVector$Mutator.setSafe(VarBinaryVector.java:502)
at
org.apache.drill.exec.test.generated.StreamingAggregatorGen2877.outputRecordKeys(StreamingAggTemplate.java:227)
at
org.apache.drill.exec.test.generated.StreamingAggregatorGen2877.outputToBatch(StreamingAggTemplate.java:300)
at
org.apache.drill.exec.test.generated.StreamingAggregatorGen2877.doWork(StreamingAggTemplate.java:142)
at
org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:127)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
at
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:68)
at
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:152)
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:58)
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:163)
... 4 more

The setup I'm using is a 10GB data set in the parquet format and the column
used for unique value counting is delta-enconded 32B array (Parquet's
FIXED_LEN_BYTE_ARRAY). Should I file a bug?

Cheers,
Marcin

On Wed, Apr 8, 2015 at 12:45 AM, Jinfeng Ni <[email protected]> wrote:

> ASAIK, Drill's planner currently does not expose the sort-ness of
> underlying data; if the data is pre-sorted, Drill planner would not
> recognize that, and still would require a sort operator for sort-based
> aggregation. Part of the reason is that Drill does not have a centralized
> meta-store, to keep track of the meta data of the data/tables/files.
> Therefore, if the the underlying data does not expose the sort-ness,
> planner does not have such knowledge.
>
> We do plan to enhance the planer / storage plugin interface, such that
> planner would be able to utilize the sort-ness / partition columns
> information.
>
>
>
> On Tue, Apr 7, 2015 at 3:25 PM, Ted Dunning <[email protected]> wrote:
>
> > Marcin,
> >
> > They did comment.  The answer is that the default is to use hashed
> > aggregation (which will be faster when there is lots of memory) with the
> > option to use sort aggregation (which is basically what you were
> > suggesting).
> >
> > Did you mean to suggest that your data is already known to be sorted and
> > thus the sort step should be omitted?
> >
> >
> > On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <[email protected]>
> > wrote:
> >
> > > @Jacques, thanks for the information - I'm definitely going to check
> out
> > > that option.
> > >
> > > I'm also curious that none of you guys commented on my original idea of
> > > counting distinct values by a simple aggregation of pre-sorted data -
> is
> > it
> > > because it doesn't make sense to you guys, or because you think your
> > > suggestions are easier to implement?
> > >
> > > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <[email protected]>
> > wrote:
> > >
> > > > Two additional notes here:
> > > >
> > > > Drill can actually do an aggregation using either a hash table based
> > > > aggregation or a sort based aggregation.  By default, generally the
> > hash
> > > > aggregation will be selected first.  However, you can disable hash
> > based
> > > > aggregation if you specifically think that a sort based aggregation
> > will
> > > > perform better for use case.  You can do this by running the command
> > > ALTER
> > > > SESSION SET `planner.enable_hashagg` = FALSE;
> > > >
> > > > We have always had it on our roadmap to implement an approximate
> count
> > > > distinct function but haven't gotten to it yet.  As Ted mentions,
> using
> > > > this technique would substantially reduce data shuffling and could be
> > > done
> > > > with a moderate level of effort since our UDAF interface is
> pluggable.
> > > >
> > > >
> > > >
> > > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <[email protected]>
> > > wrote:
> > > >
> > > > > How precise do your counts need to be?  Can you accept a fraction
> of
> > a
> > > > > percent statistical error?
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <[email protected]>
> > > wrote:
> > > > >
> > > > > > Drill already does most of this type of transformation.  If you
> do
> > an
> > > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>'
> > > > > > you will see that it first does a grouping on the column and then
> > > > applies
> > > > > > the COUNT(column).  The first level grouping can be done either
> > based
> > > > on
> > > > > > sorting or hashing and this is configurable through a system
> > option.
> > > > > >
> > > > > > Aman
> > > > > >
> > > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <
> > > [email protected]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Guys,
> > > > > > >
> > > > > > > I have a specific use case for Drill, in which I'd like to be
> > able
> > > to
> > > > > > count
> > > > > > > unique values in columns with tens millions of distinct values.
> > The
> > > > > COUNT
> > > > > > > DISTINCT method, unfortunately, does not scale both time- and
> > > > > memory-wise
> > > > > > > and the idea is to sort the data beforehand by the values of
> that
> > > > > column
> > > > > > > (let's call it ID), to have the row groups split at new a new
> ID
> > > > > boundary
> > > > > > > and to extend Drill with an alternative version of COUNT that
> > would
> > > > > > simply
> > > > > > > count the number of times the ID changes through out the entire
> > > > table.
> > > > > > This
> > > > > > > way, we could expect that counting unique values of pre-sorted
> > > > columns
> > > > > > > could have complexity comparable to that of the regular COUNT
> > > > operator
> > > > > (a
> > > > > > > full scan). So, to sum up, I have three questions:
> > > > > > >
> > > > > > > 1. Can such a scenario be realized in Drill?
> > > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an
> > > > > operator),
> > > > > > > so without heavy hacking throughout entire Drill?
> > > > > > > 3. How to do it?
> > > > > > >
> > > > > > > Our initial experience with Drill was very good - it's an
> > excellent
> > > > > tool.
> > > > > > > But in order to be able to adopt it, we need to sort out this
> one
> > > > > central
> > > > > > > issue.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Marcin
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Counting large numbers of unique values

Reply via email to