I brought this up in the Dev Sync but thought I would write up a couple of findings here.
We have some large results in TopN queries that come back, and have been looking at optimizations in the TopN (or GroupBy) query path in order to accommodate these larger results sets returning from many hundreds of nodes. Looking at the TopN binary apply function io.druid.query.topn.TopNBinaryFn#apply <https://github.com/apache/incubator-druid/blob/druid-0.12.1/processing/src/main/java/io/druid/query/topn/TopNBinaryFn.java#L75-L135> which does the result folding, there is a basic hash-join of the two results in order to do the fold. This ends up with a lot of hash map operations for creation and adding entries. I tried some really basic optimizations to reduce the number of hash map operations in this function, but they did not result in any measurable improvement in a real environment. You can see some cpu time flame graphs in https://github.com/apache/incubator-druid/pull/5913 . Ideally work should be done in the aggregator combining functions rather than a whole bunch of hash map state manipulations. One potential improvement would be to move from a hash join to a merge join. But such a scenario would require changing of the ordering of the items so that they can be merge joined. The current ordering is based on aggregation specification order. This change should allow iterating through the topn result values only once, and have a simple way to insert new values in one stream or another into the result topn result value. This means the query path would sort the results on query time, and shuffle the result to retain "specification order" only on the last stage out (during a "finalize" kind of step). In such a scenario, a potential future optimization would be to allow results to be streamed back per topn result value. I *think* the current implementation only considers the timestamp level, meaning if you do an ALL granularity query, there is only one "chunk" of results that can be streamed back. I haven't been digging deeply into this aspect though. Such an optimization should be able to be applied to group by queries as well, so I don't know if the folks working heavily on the group by queries have considered this or alternatives. My question is as follows: Are there any issues people see for either using the Finalize flag of a query to determine the sort order, or adding a new query context to determine if the sort order should be specification order (default) or lexicographic order (internal override) ? Thanks, Charles Allen