Hi, Jeszy Thank you for your reply.
My understanding is that you're mentioning sampling. Although both topN and sampling are an approximate technique for making queries run faster, I think they are difference concept. Using topN, by returning only N aggregated item on each node, we can eliminate expensive shuffle operation whereas sampling can reduce amount of input data. topN can be used without sampling, and sampling can be used without topN, and they can be used at the same time. My experiment on Druid 0.10.0 over my Dataset shows that "topN without sampling" is 100 times faster than GroupBy & OrderBy, and "topN with sampling" is 200 times after than GroupBy & OrderBy. Currently not many of Distributed SQL Engine support topN, by implementing topN Impala could be adopted by many types of analytic systems. Thanks. Regards, Jason 2017-11-28 23:19 GMT+09:00 Jeszy <[email protected]>: > Hello Jason, > > IMPALA-5300 (https://issues.apache.org/jira/browse/IMPALA-5300) is in > the works, and I think it fits your use case. Can you take a look? > > Thanks! > > On 28 November 2017 at 15:11, Jason Heo <[email protected]> wrote: > > Hi, > > > > I'm wondering impala team has any plans for approximate topN for single > > dimension. > > > > My Web analytic system mostly serves top n urls. Such a "GROUP BY url > ORDER > > BY pageview LIMIT n" is slow especially for high-cardinality field. > > Approximate topN can be used instead of GroupBy for single dimension with > > extremely lower latency. > > > > Elastisearch, Druid, and Clickhouse already provide this feature. > > > > It would be great if I can use it on Druid. > > > > Thanks. > > > > Regards, > > > > Jason >
