Gabor is right on the last point - the Top-N node is used when there is an ORDER BY with a LIMIT <n> clause in the same SELECT block. Each thread executing the Top-N node only needs to keep n rows from the input during processing regardless of the input size. This is really very effective and explains the results you're seeing.
I don't think a design doc was ever written for the Top-N operator. The comments in the implementation explain how it works at a high level and are probably the best place to understand how it works https://github.com/apache/impala/blob/master/be/src/exec/topn-node.h#L120. "External Merge Sort for Top-K queries" by Chronis et al - https://dl.acm.org/doi/pdf/10.1145/3318464.3389729 - has a good overview of techniques for this problem in sections 1 and 2. Impala does 2.3 and falls back to 2.4 for larger limits, at least last time I looked. I think Gabor is suggesting something like 2.5 which is a nice optimisation that Impala doesn't have afaik. On Tue, 19 Jul 2022 at 02:59, Gabor Kaszab <gaborkas...@cloudera.com> wrote: > Hi Sameera, > > If a sorter (or a TOP-N node in your case) doesn't fit into memory then it > does a partial sort for the data that fits into memory (called a "run" if > I'm not mistaken) and writes them into disk. After this it loads another > subset of the data into memory, does a partial sort again, and spills to > disk. It does this until the node runs out of rows to be processed. Once > this happens a merging sort is applied on the partially sorted "runs". > Specifically in your query the 6 executor nodes each do these partial sorts > and then the merging sort to get the 2 rows as a result due to the LIMIT > clause for each executor node. A finishing step is a MERGING-EXCHANGE that > receives 2 rows from each 6 executors and gets the final result of the > "ORDER BY LIMIT 2". > As a result the more memory you allocate to Impala the bigger these > partially sorted "runs" could be, the easier the merging sort becomes. > Additionally, I haven't checked but I assume if you have a LIMIT clause > Impala doesn't have to spill the whole run into memory just the part that > survives the LIMIT clause (2 rows per run in this case) so the less runs > you have the less you have to deal with IO to disk. > > Update: I gave this a second thought and I assume that a top-n node (when > you have a LIMIT clause) doesn't even have to spill to disk either as in > your case it only has to maintain the highest 2 values due to the LIMIT > clause. > > Hope this helps. > > On Mon, Jul 18, 2022 at 8:26 PM sameera <sameeranadis...@gmail.com> wrote: > >> Hi Team, >> >> I'm seeing this behaviour regarding memory utilization in impala >> executors. >> I have a table with 298GB of parquet data in S3 and I'm trying to scan >> all partitions and do a Top-N operation to understand the memory usage in >> executors. >> >> When i set executor startup option mem_limit to 1GB, the query takes >> 600s. With 10GB mem_limit it completes within 200s while utilizing memory >> upto 10GB. This could be because Impala has decided to set >> NumScannerThreadsStarted 1 in 1GB mem_limit and NumScannerThreadsStarted >> are 16 in 10GB mem_limit settings. >> >> During both tests I didn't see any intermediate files created in scratch >> dirs. My question is how does Impala manage to complete the entire table >> sorting when memory is limited to 1GB? Please help me to understand how it >> internally works. Any design document would be really helpful. >> >> select * from execution_report_s3_lse order by time_sequence limit 1; >> >> Impala version 3.4 >> 6 executor node cluster with dedicated coordinators. >> Node spec - 16 Core, 32 GB memory >> >> Thank you, >> Sameera. >> >