Lantao Jin created SPARK-30114:
----------------------------------

             Summary: Optimize LIMIT only query by partial listing files
                 Key: SPARK-30114
                 URL: https://issues.apache.org/jira/browse/SPARK-30114
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Lantao Jin


We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT 
operation. When we execute some queries like
1) SELECT * FROM TABLE_A LIMIT N
2) SELECT colA FROM TABLE_A LIMIT N
3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
If the TABLE_A is a large table (a RDD with thousands and thousands of 
partitions), the execution time would be very big since it has to list all 
files to build a RDD before execution. But almost time, the N is just like 10, 
100, 1000, not very big. We don't need to scan all files. This optimization 
will create a *SinglePartitionReadRDD* to address it.

In our production result, this optimization benefits a lot. The duration time 
of simple query with LIMIT could reduce 5~10 times. For example, before this 
optimization, a query on a table which has about one hundred thousands files 
would run over 30 seconds, after applying this optimization, the time decreased 
to 5 seconds.


Should support both Spark DataSource Table and Hive Table which can be 
converted to DataSource table.
Should support bucket table, partition table, normal table.
Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to