Hi, Jeszy

Thank you for your reply.

My understanding is that you're mentioning sampling.

Although both topN and sampling are an approximate technique for making
queries run faster, I think they are difference concept.

Using topN, by returning only N aggregated item on each node, we can
eliminate expensive shuffle operation whereas sampling can reduce amount of
input data.

topN can be used without sampling, and sampling can be used without topN,
and they can be used at the same time.

My experiment on Druid 0.10.0 over my Dataset shows that "topN without
sampling" is 100 times faster than GroupBy & OrderBy, and "topN with
sampling" is 200 times after than GroupBy & OrderBy.

Currently not many of Distributed SQL Engine support topN, by implementing
topN Impala could be adopted by many types of analytic systems.

Thanks.

Regards,

Jason


2017-11-28 23:19 GMT+09:00 Jeszy <[email protected]>:

> Hello Jason,
>
> IMPALA-5300 (https://issues.apache.org/jira/browse/IMPALA-5300) is in
> the works, and I think it fits your use case. Can you take a look?
>
> Thanks!
>
> On 28 November 2017 at 15:11, Jason Heo <[email protected]> wrote:
> > Hi,
> >
> > I'm wondering impala team has any plans for approximate topN for single
> > dimension.
> >
> > My Web analytic system mostly serves top n urls. Such a "GROUP BY url
> ORDER
> > BY pageview LIMIT n" is slow especially for high-cardinality field.
> > Approximate topN can be used instead of GroupBy for single dimension with
> > extremely lower latency.
> >
> > Elastisearch, Druid, and Clickhouse already provide this feature.
> >
> > It would be great if I can use it on Druid.
> >
> > Thanks.
> >
> > Regards,
> >
> > Jason
>

Reply via email to