Tim Armstrong created IMPALA-10347:
--------------------------------------

             Summary: Explore approaches to optimizing queries that will likely 
be short-circuited by limits
                 Key: IMPALA-10347
                 URL: https://issues.apache.org/jira/browse/IMPALA-10347
             Project: IMPALA
          Issue Type: Improvement
          Components: Distributed Exec
            Reporter: Tim Armstrong


Based on discussion with [~amansinha], there are opportunities beyond 
IMPALA-10314 to optimize queries where there is a limit and the query is 
*unlikely* to scan many files.

The problem is that we do all the work to generate scan ranges and schedule 
them upfront, which adds a lot of overhead if only a small number of files 
actually need to be processed.

A couple of ideas we had:
* Parallelize and/or otherwise optimize the scan range generation
* Speculatively execute the query on a subset of files and then cancel and 
retry if we hit the limit
* Incrementally generate scan ranges and assign them to executors so that scan 
range generation and execution can be overlapped. This is the most general 
solution but also has a lot of knock-on implications for other subsystems, like 
cardinality/memory estimation, scheduling, query execution, query coordination, 
etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to