Tim Armstrong created IMPALA-10347: -------------------------------------- Summary: Explore approaches to optimizing queries that will likely be short-circuited by limits Key: IMPALA-10347 URL: https://issues.apache.org/jira/browse/IMPALA-10347 Project: IMPALA Issue Type: Improvement Components: Distributed Exec Reporter: Tim Armstrong
Based on discussion with [~amansinha], there are opportunities beyond IMPALA-10314 to optimize queries where there is a limit and the query is *unlikely* to scan many files. The problem is that we do all the work to generate scan ranges and schedule them upfront, which adds a lot of overhead if only a small number of files actually need to be processed. A couple of ideas we had: * Parallelize and/or otherwise optimize the scan range generation * Speculatively execute the query on a subset of files and then cancel and retry if we hit the limit * Incrementally generate scan ranges and assign them to executors so that scan range generation and execution can be overlapped. This is the most general solution but also has a lot of knock-on implications for other subsystems, like cardinality/memory estimation, scheduling, query execution, query coordination, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org