Agree that the techniques (approximation and sampling) are different and complementary.
Our current user base tends to require exact query responses, so this is a direction we have not seriously explored. You are certainly welcome to flesh out your ideas in more detail and propose/make a contribution! Perhaps other members of the community agree with you and are willing to hep. On Tue, Nov 28, 2017 at 6:24 PM, Jason Heo <[email protected]> wrote: > Hi, Jeszy > > Thank you for your reply. > > My understanding is that you're mentioning sampling. > > Although both topN and sampling are an approximate technique for making > queries run faster, I think they are difference concept. > > Using topN, by returning only N aggregated item on each node, we can > eliminate expensive shuffle operation whereas sampling can reduce amount of > input data. > > topN can be used without sampling, and sampling can be used without topN, > and they can be used at the same time. > > My experiment on Druid 0.10.0 over my Dataset shows that "topN without > sampling" is 100 times faster than GroupBy & OrderBy, and "topN with > sampling" is 200 times after than GroupBy & OrderBy. > > Currently not many of Distributed SQL Engine support topN, by implementing > topN Impala could be adopted by many types of analytic systems. > > Thanks. > > Regards, > > Jason > > > 2017-11-28 23:19 GMT+09:00 Jeszy <[email protected]>: > >> Hello Jason, >> >> IMPALA-5300 (https://issues.apache.org/jira/browse/IMPALA-5300) is in >> the works, and I think it fits your use case. Can you take a look? >> >> Thanks! >> >> On 28 November 2017 at 15:11, Jason Heo <[email protected]> wrote: >> > Hi, >> > >> > I'm wondering impala team has any plans for approximate topN for single >> > dimension. >> > >> > My Web analytic system mostly serves top n urls. Such a "GROUP BY url >> ORDER >> > BY pageview LIMIT n" is slow especially for high-cardinality field. >> > Approximate topN can be used instead of GroupBy for single dimension >> with >> > extremely lower latency. >> > >> > Elastisearch, Druid, and Clickhouse already provide this feature. >> > >> > It would be great if I can use it on Druid. >> > >> > Thanks. >> > >> > Regards, >> > >> > Jason >> > >
