[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528379#comment-16528379 ] Dongjoon Hyun commented on SPARK-23309: --- Thank you for updating, [~tgraves]. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527865#comment-16527865 ] Thomas Graves commented on SPARK-23309: --- We tried this on a newest 2.3.1 and haven't been able to reproduce this, so closing. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491233#comment-16491233 ] Xiao Li commented on SPARK-23309: - [~vanzin] https://issues.apache.org/jira/browse/SPARK-24373 is not related to this JIRA. This JIRA uses a pure SQL and thus it will not hit the problem caused by AnalysisBarrier. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489391#comment-16489391 ] Marcelo Vanzin commented on SPARK-23309: [~kiszk] SPARK-24373 has some code. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482476#comment-16482476 ] Truong Duc Kien commented on SPARK-23309: - We're also having also a performance problem with cached query on Spark 2.3. Once in a while, a query will take abnormally long time. We take a look at the thead-dump and see the executor waiting to fetch remote cached blocks, which progresses very slowly. It seems to be a run-time bug, because if we run the same query again, the slow-down might go away. This did not happen with 2.2.1. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357441#comment-16357441 ] Kazuaki Ishizaki commented on SPARK-23309: -- When there is a repro, I am happy to investigate the reason. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357363#comment-16357363 ] Sameer Agarwal commented on SPARK-23309: Thanks, I'll then go ahead and downgrade the priority for now to unblock RC3. Please feel free to -1 the RC if there's a repro. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357346#comment-16357346 ] Thomas Graves commented on SPARK-23309: --- sorry I haven't had time to make a query/dataset to reproduce that. I'm ok with this not being a blocker for 2.3. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357344#comment-16357344 ] Sameer Agarwal commented on SPARK-23309: [~tgraves] [~smilegator] [~cloud_fan] – any advice here? If we'd like this to continue to block the release on this, it'd be good to have a repro. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354967#comment-16354967 ] Wenchen Fan commented on SPARK-23309: - is it possible to provide a concrete query(with table schema) to demonstrate the performance regression? By looking at the code I can't find any potential places that may contribute to this regression. We need to do some profile and this issue may be caused by something else(e.g. aggregate), not the cache. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354336#comment-16354336 ] Thomas Graves commented on SPARK-23309: --- I pulled in that patch ([https://github.com/apache/spark/pull/20513]) and numbers got better but am still seeing 10% slower on 2.3. (this is down from 15%) This is using the configs: --conf spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=true --conf spark.sql.hive.convertMetastoreOrc=false --conf spark.sql.inMemoryColumnarStorage.enableVectorizedReader=false has anyone else reproduced this or is it only me seeing it? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353334#comment-16353334 ] Wenchen Fan commented on SPARK-23309: - By looking at the code, the only difference between 2.3 and 2.2 when disabling the columnar cache reader is whole stage codegen. I've sent [https://github.com/apache/spark/pull/20513] to totally keep the behavior same as 2.2 if we disable columnar cache reader. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350901#comment-16350901 ] Thomas Graves commented on SPARK-23309: --- I should ask is there a log statement or query plan I can dump out just to make sure spark.sql.inMemoryColumnarStorage.enableVectorizedReader=false was applied properly? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350900#comment-16350900 ] Thomas Graves commented on SPARK-23309: --- So the last test I did was spark 2.3 with the old hive path and spark 2.2. Spark 2.3 is slower then spark 2.2 reading the cached data. [~smilegator] I already tried the patch, see the last config I tested with where -conf spark.sql.inMemoryColumnarStorage.enableVectorizedReader=false > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350857#comment-16350857 ] Xiao Li commented on SPARK-23309: - Based on my understanding about what [~tgraves]said above, the number of partitions is different between our ORC reader and Hive-serde reader because we do not respect Hive confs. Now the performance regression is observed when we read cached data. This should not be related to Hive. This https://issues.apache.org/jira/browse/SPARK-23312 has been merged. Thus, maybe [~tgraves]can try that patch and see whether the performance regression is gone after setting {{spark.sql.inMemoryColumnarStorage.enableVectorizedReader}} to {{false}}? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350846#comment-16350846 ] Dongjoon Hyun commented on SPARK-23309: --- To sum up, the same Hive code (old Hive path) of Spark 2.3/Spark2.2 is used and Spark 2.3 is slower than Spark 2.2. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350813#comment-16350813 ] Thomas Graves commented on SPARK-23309: --- I'm still seeing spark 2.3 slower by about 15% for the larger dataset. I tried => --conf spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=true --conf spark.sql.hive.convertMetastoreOrc=false and then also tried setting the vectoried reader to false => --conf spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=true --conf spark.sql.hive.convertMetastoreOrc=false --conf spark.sql.inMemoryColumnarStorage.enableVectorizedReader=false Note the # of partitions its processing is now the same since turning off the native orc impl. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350533#comment-16350533 ] Thomas Graves commented on SPARK-23309: --- Note the schema of "something" here is a "string". I'll try with the changes in SPARK-23312 and turn off the vectorized cache reader. I'm also running 2.3 with the configs --conf spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=true --conf spark.sql.hive.convertMetastoreOrc=false which should be the same as 2.2 and it gives me the same # of partitions > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349885#comment-16349885 ] Dongjoon Hyun commented on SPARK-23309: --- We are still investigating this, but is this a regression due to SPARK-22392 (data source v2 columnar batch reader)? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349827#comment-16349827 ] Wenchen Fan commented on SPARK-23309: - I propose to add a config to turn off vectorized cache reader: https://issues.apache.org/jira/browse/SPARK-23312 This is a workaround for the performance regression, so that 2.3 doesn't need to be blocked by this. We should continue investigate the real problem. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349770#comment-16349770 ] Wenchen Fan commented on SPARK-23309: - We need to know the schema of your cached data to figure out what's going on > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349618#comment-16349618 ] Dongjoon Hyun commented on SPARK-23309: --- Yep. I'll make a PR for migration guide. For the conf (hive.exec.orc.split.strategy), I think like you. Since I didn't try it before, I want to make it sure. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349606#comment-16349606 ] Xiao Li commented on SPARK-23309: - [~dongjoon] We need to document it in the migration guide. Basically, we ignore the conf specified by users, right? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349605#comment-16349605 ] Xiao Li commented on SPARK-23309: - What is the data type of `something`? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349588#comment-16349588 ] Dongjoon Hyun commented on SPARK-23309: --- Thank you for confirming for the non-cache case, [~tgraves]. For `hive.exec.orc.split.strategy`, I'll check and reply on SPARK-23304. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349576#comment-16349576 ] Xiao Li commented on SPARK-23309: - Just to confirm it. The cached data only has one column whose type is bigint. How many rows do you have? Could you just try some simpler queries? {{SELECT COUNT(something) from dailycached}} > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349553#comment-16349553 ] Thomas Graves commented on SPARK-23309: --- [~dongjoon] is there any native way with the native hive to control the # of partitions? (like hive.exec.orc.split.strategy). Or do you have to do the coalesce? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349551#comment-16349551 ] Thomas Graves commented on SPARK-23309: --- seeing the same time difference after adding in the spark.table("dailyCached").count() [~dongjoon] Correct this is only when read from cached data. Without caching spark 2.3 is quite a bit faster (1.5-2x+) then spark 2.2 when reading from hive using orc. (which is awesome, thanks for all the work!) I'm running now with --conf spark.sql.orc.impl=hive --conf spark.sql.hive.convertMetastoreOrc=false. For the smaller data set it did get closer, only 1 second diff on average between spark 2.2 and spark 2.3. Trying to run on the larger dataset now. I'm wondering if much of the difference is the larger # of partitions you get with hive native in spark 2.3 > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349532#comment-16349532 ] Dongjoon Hyun commented on SPARK-23309: --- According to the issue title, there is no regression without caching, isn't it? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349516#comment-16349516 ] Xiao Li commented on SPARK-23309: - [~tgraves] We are just hoping the new Hive reader does not introduce a regression. Otherwise, we might need to change the default value of ORC reader to hive. I expect the cache reader should be faster after the PR https://github.com/apache/spark/pull/18747. If not, we might need to change the codes for avoiding the performance regression. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349510#comment-16349510 ] Dongjoon Hyun commented on SPARK-23309: --- Thank you for reporting this, [~tgraves]. In addition to `spark.sql.orc.impl=hive`, could you try this with Parquet file format, too? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349483#comment-16349483 ] Thomas Graves commented on SPARK-23309: --- sure, I can also run with the --conf spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=false --conf spark.sql.hive.convertMetastoreOrc=false configs to make sure it doesn't just have to do with # of partitions > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Blocker > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349470#comment-16349470 ] Xiao Li commented on SPARK-23309: - [~tgraves] Could you first run count before you run the show? {noformat} val dailycached = spark.sql("select something from table where dt = '20170301' AND something IS NOT NULL") dailycached.createOrReplaceTempView("dailycached") spark.catalog.cacheTable("dailyCached") spark.table("dailyCached").count() spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() {noformat} You know our cache is lazy. We run the query when first reading from the cache. Just want to see whether this regression is from cache or from query reading. > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349431#comment-16349431 ] Thomas Graves commented on SPARK-23309: --- I'm curious if anyone else is seeing the same behavior? > Spark 2.3 cached query performance 20-30% worse then spark 2.2 > -- > > Key: SPARK-23309 > URL: https://issues.apache.org/jira/browse/SPARK-23309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > I was testing spark 2.3 rc2 and I am seeing a performance regression in sql > queries on cached data. > The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 > partitions > Here is the example query: > val dailycached = spark.sql("select something from table where dt = > '20170301' AND something IS NOT NULL") > dailycached.createOrReplaceTempView("dailycached") > spark.catalog.cacheTable("dailyCached") > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > > On spark 2.2 I see queries times average 13 seconds > On the same nodes I see spark 2.3 queries times average 17 seconds > Note these are times of queries after the initial caching. so just running > the last line again: > spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() > multiple times. > > I also ran a query over more data (335GB input/587.5 GB cached) and saw a > similar discrepancy in the performance of querying cached data between spark > 2.3 and spark 2.2, where 2.2 was better by like 20%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org