[jira] [Updated] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16186: -- Assignee: Dongjoon Hyun > Support partition batch pruning with `IN` predicate in InMemoryTableScanExec > > > Key: SPARK-16186 > URL: https://issues.apache.org/jira/browse/SPARK-16186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.1.0 > > > One of the most frequent usage patterns for Spark SQL is using **cached > tables**. > This issue improves `InMemoryTableScanExec` to handle `IN` predicate > efficiently by pruning partition batches. > Of course, the performance improvement varies over the queries and the > datasets. For the following simple query, the query duration in Spark UI goes > from 9 seconds to 50~90ms. It's about over 100 times faster. > {code} > $ bin/spark-shell --driver-memory 6G > scala> val df = spark.range(20) > scala> df.createOrReplaceTempView("t") > scala> spark.catalog.cacheTable("t") > scala> sql("select id from t where id = 1").collect()// About 2 mins > scala> sql("select id from t where id = 1").collect()// less than 90ms > scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds > (Before) > scala> sql("select id from t where id in (1,2,3)").collect() // less than > 90ms (After) > {code} > This issue has impacts over 35 queries of TPC-DS if the tables are cached. > Note that this optimization is applied for IN. To apply IN predicate having > more than 10 items, *spark.sql.optimizer.inSetConversionThreshold* option > should be increased. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16186: -- Description: One of the most frequent usage patterns for Spark SQL is using **cached tables**. This issue improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. For the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster. {code} $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(20) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect()// About 2 mins scala> sql("select id from t where id = 1").collect()// less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds (Before) scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms (After) {code} This issue has impacts over 35 queries of TPC-DS if the tables are cached. Note that this optimization is applied for IN. To apply IN predicate having more than 10 items, *spark.sql.optimizer.inSetConversionThreshold* option should be increased. was: One of the most frequent usage patterns for Spark SQL is using **cached tables**. This issue improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. For the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster. {code} $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(20) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect()// About 2 mins scala> sql("select id from t where id = 1").collect()// less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 10) // Enable. (Just to show this examples, currently the default value is 10.) scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 0) // Disable scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds {code} This issue has impacts over 35 queries of TPC-DS if the tables are cached. > Support partition batch pruning with `IN` predicate in InMemoryTableScanExec > > > Key: SPARK-16186 > URL: https://issues.apache.org/jira/browse/SPARK-16186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > One of the most frequent usage patterns for Spark SQL is using **cached > tables**. > This issue improves `InMemoryTableScanExec` to handle `IN` predicate > efficiently by pruning partition batches. > Of course, the performance improvement varies over the queries and the > datasets. For the following simple query, the query duration in Spark UI goes > from 9 seconds to 50~90ms. It's about over 100 times faster. > {code} > $ bin/spark-shell --driver-memory 6G > scala> val df = spark.range(20) > scala> df.createOrReplaceTempView("t") > scala> spark.catalog.cacheTable("t") > scala> sql("select id from t where id = 1").collect()// About 2 mins > scala> sql("select id from t where id = 1").collect()// less than 90ms > scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds > (Before) > scala> sql("select id from t where id in (1,2,3)").collect() // less than > 90ms (After) > {code} > This issue has impacts over 35 queries of TPC-DS if the tables are cached. > Note that this optimization is applied for IN. To apply IN predicate having > more than 10 items, *spark.sql.optimizer.inSetConversionThreshold* option > should be increased. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16186: -- Description: One of the most frequent usage patterns for Spark SQL is using **cached tables**. This issue improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. For the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster. {code} $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(20) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect()// About 2 mins scala> sql("select id from t where id = 1").collect()// less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 10) // Enable. (Just to show this examples, currently the default value is 10.) scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 0) // Disable scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds {code} This issue has impacts over 35 queries of TPC-DS if the tables are cached. was: One of the most frequent usage patterns for Spark SQL is using **cached tables**. This issue improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. For the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about 100 times faster. {code} $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(20) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect()// About 2 mins scala> sql("select id from t where id = 1").collect()// less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 10) // Enable. (Just to show this examples, currently the default value is 10.) scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 0) // Disable scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds {code} This issue has impacts over 35 queries of TPC-DS if the tables are cached. > Support partition batch pruning with `IN` predicate in InMemoryTableScanExec > > > Key: SPARK-16186 > URL: https://issues.apache.org/jira/browse/SPARK-16186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > One of the most frequent usage patterns for Spark SQL is using **cached > tables**. > This issue improves `InMemoryTableScanExec` to handle `IN` predicate > efficiently by pruning partition batches. > Of course, the performance improvement varies over the queries and the > datasets. For the following simple query, the query duration in Spark UI goes > from 9 seconds to 50~90ms. It's about over 100 times faster. > {code} > $ bin/spark-shell --driver-memory 6G > scala> val df = spark.range(20) > scala> df.createOrReplaceTempView("t") > scala> spark.catalog.cacheTable("t") > scala> sql("select id from t where id = 1").collect()// About 2 mins > scala> sql("select id from t where id = 1").collect()// less than 90ms > scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds > scala> > spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", > 10) // Enable. (Just to show this examples, currently the default value is > 10.) > scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms > scala> > spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", > 0) // Disable > scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds > {code} > This issue has impacts over 35 queries of TPC-DS if the tables are cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16186: -- Description: One of the most frequent usage patterns for Spark SQL is using **cached tables**. This issue improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. For the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about 100 times faster. {code} $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(20) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect()// About 2 mins scala> sql("select id from t where id = 1").collect()// less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 10) // Enable. (Just to show this examples, currently the default value is 10.) scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 0) // Disable scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds {code} This issue has impacts over 35 queries of TPC-DS if the tables are cached. was: One of the most frequent usage patterns for Spark SQL is using **cached tables**. This issue improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. For the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about 100 times faster. {code} $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(20) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect()// About 2 mins scala> sql("select id from t where id = 1").collect()// less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds scala> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 10) // Enable. (Just to show this examples, currently the default value is 10.) scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 0) // Disable scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms {code} This issue has impacts over 35 queries of TPC-DS if the tables are cached. > Support partition batch pruning with `IN` predicate in InMemoryTableScanExec > > > Key: SPARK-16186 > URL: https://issues.apache.org/jira/browse/SPARK-16186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > One of the most frequent usage patterns for Spark SQL is using **cached > tables**. > This issue improves `InMemoryTableScanExec` to handle `IN` predicate > efficiently by pruning partition batches. > Of course, the performance improvement varies over the queries and the > datasets. For the following simple query, the query duration in Spark UI goes > from 9 seconds to 50~90ms. It's about 100 times faster. > {code} > $ bin/spark-shell --driver-memory 6G > scala> val df = spark.range(20) > scala> df.createOrReplaceTempView("t") > scala> spark.catalog.cacheTable("t") > scala> sql("select id from t where id = 1").collect()// About 2 mins > scala> sql("select id from t where id = 1").collect()// less than 90ms > scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds > scala> > spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", > 10) // Enable. (Just to show this examples, currently the default value is > 10.) > scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms > spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", > 0) // Disable > scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds > {code} > This issue has impacts over 35 queries of TPC-DS if the tables are cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org