[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns
[ https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790409#comment-16790409 ] Luca Canali commented on SPARK-24486: - Thanks [~yumwang] for looking at this. Indeed I confirm that using collect instead of show is faster. In addition, testing on Spark master (March 12 2019) I see that show works fast there too (I have not yet looked at which PR fixed this). > Slow performance reading ArrayType columns > -- > > Key: SPARK-24486 > URL: https://issues.apache.org/jira/browse/SPARK-24486 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Luca Canali >Priority: Minor > > We have found an issue of slow performance in one of our applications when > running on Spark 2.3.0 (the same workload does not have a performance issue > on Spark 2.2.1). We suspect a regression in the area of handling columns of > ArrayType. I have built a simplified test case showing a manifestation of the > issue to help with troubleshooting: > > > {code:java} > // prepare test data > val stringListValues=Range(1,3).mkString(",") > sql(s"select 1 as myid, Array($stringListValues) as myarray from > range(2)").repartition(1).write.parquet("file:///tmp/deleteme1") > // run test > spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code} > Performance measurements: > > On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 > (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec > on Spark 2.3.0 > > Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 > only 2 records are read by this workload, while on Spark 2.3.0 all rows in > the file are read, which appears anomalous. > Example: > {code:java} > bin/spark-shell --master local[*] --driver-memory 2g --packages > ch.cern.sparkmeasure:spark-measure_2.11:0.11 > val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) > stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show()) > {code} > > > Selected metrics from Spark 2.3.0 run: > > {noformat} > elapsedTime => 17849 (18 s) > sum(numTasks) => 11 > sum(recordsRead) => 2 > sum(bytesRead) => 1136448171 (1083.0 MB){noformat} > > > From Spark 2.2.1 run: > > {noformat} > elapsedTime => 1329 (1 s) > sum(numTasks) => 2 > sum(recordsRead) => 2 > sum(bytesRead) => 269162610 (256.0 MB) > {noformat} > > Note: Using Spark built from master (as I write this, June 7th 2018) shows > the same behavior as found in Spark 2.3.0 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns
[ https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624723#comment-16624723 ] Yuming Wang commented on SPARK-24486: - I can reproduce this issue. I'm working on. > Slow performance reading ArrayType columns > -- > > Key: SPARK-24486 > URL: https://issues.apache.org/jira/browse/SPARK-24486 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Luca Canali >Priority: Minor > > We have found an issue of slow performance in one of our applications when > running on Spark 2.3.0 (the same workload does not have a performance issue > on Spark 2.2.1). We suspect a regression in the area of handling columns of > ArrayType. I have built a simplified test case showing a manifestation of the > issue to help with troubleshooting: > > > {code:java} > // prepare test data > val stringListValues=Range(1,3).mkString(",") > sql(s"select 1 as myid, Array($stringListValues) as myarray from > range(2)").repartition(1).write.parquet("file:///tmp/deleteme1") > // run test > spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code} > Performance measurements: > > On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 > (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec > on Spark 2.3.0 > > Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 > only 2 records are read by this workload, while on Spark 2.3.0 all rows in > the file are read, which appears anomalous. > Example: > {code:java} > bin/spark-shell --master local[*] --driver-memory 2g --packages > ch.cern.sparkmeasure:spark-measure_2.11:0.11 > val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) > stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show()) > {code} > > > Selected metrics from Spark 2.3.0 run: > > {noformat} > elapsedTime => 17849 (18 s) > sum(numTasks) => 11 > sum(recordsRead) => 2 > sum(bytesRead) => 1136448171 (1083.0 MB){noformat} > > > From Spark 2.2.1 run: > > {noformat} > elapsedTime => 1329 (1 s) > sum(numTasks) => 2 > sum(recordsRead) => 2 > sum(bytesRead) => 269162610 (256.0 MB) > {noformat} > > Note: Using Spark built from master (as I write this, June 7th 2018) shows > the same behavior as found in Spark 2.3.0 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns
[ https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505192#comment-16505192 ] Luca Canali commented on SPARK-24486: - Thanks for your comment. In all 3 cases (spark 2.2.1, 2.3.0 and latest version from master) I am using a simple test workload to investigate the issue, that is: spark.read.parquet("file:///tmp/deleteme1").limit(1).show() The output is simply the first row of the test table, that is an int value "1" and an array of 3 int elements. I'll be happy to provide additional info on the tests and workload. BTW, it should be straighforward to reproduce this issue in a test environemnt if you can spare the time. > Slow performance reading ArrayType columns > -- > > Key: SPARK-24486 > URL: https://issues.apache.org/jira/browse/SPARK-24486 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Luca Canali >Priority: Minor > > We have found an issue of slow performance in one of our applications when > running on Spark 2.3.0 (the same workload does not have a performance issue > on Spark 2.2.1). We suspect a regression in the area of handling columns of > ArrayType. I have built a simplified test case showing a manifestation of the > issue to help with troubleshooting: > > > {code:java} > // prepare test data > val stringListValues=Range(1,3).mkString(",") > sql(s"select 1 as myid, Array($stringListValues) as myarray from > range(2)").repartition(1).write.parquet("file:///tmp/deleteme1") > // run test > spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code} > Performance measurements: > > On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 > (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec > on Spark 2.3.0 > > Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 > only 2 records are read by this workload, while on Spark 2.3.0 all rows in > the file are read, which appears anomalous. > Example: > {code:java} > bin/spark-shell --master local[*] --driver-memory 2g --packages > ch.cern.sparkmeasure:spark-measure_2.11:0.11 > val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) > stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show()) > {code} > > > Selected metrics from Spark 2.3.0 run: > > {noformat} > elapsedTime => 17849 (18 s) > sum(numTasks) => 11 > sum(recordsRead) => 2 > sum(bytesRead) => 1136448171 (1083.0 MB){noformat} > > > From Spark 2.2.1 run: > > {noformat} > elapsedTime => 1329 (1 s) > sum(numTasks) => 2 > sum(recordsRead) => 2 > sum(bytesRead) => 269162610 (256.0 MB) > {noformat} > > Note: Using Spark built from master (as I write this, June 7th 2018) shows > the same behavior as found in Spark 2.3.0 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns
[ https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504845#comment-16504845 ] Kazuaki Ishizaki commented on SPARK-24486: -- Thank you for reporting a problem. Could you please let us know which value is shown for each of three results in `sum(...)`? > Slow performance reading ArrayType columns > -- > > Key: SPARK-24486 > URL: https://issues.apache.org/jira/browse/SPARK-24486 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Luca Canali >Priority: Minor > > We have found an issue of slow performance in one of our applications when > running on Spark 2.3.0 (the same workload does not have a performance issue > on Spark 2.2.1). We suspect a regression in the area of handling columns of > ArrayType. I have built a simplified test case showing a manifestation of the > issue to help with troubleshooting: > > > {code:java} > // prepare test data > val stringListValues=Range(1,3).mkString(",") > sql(s"select 1 as myid, Array($stringListValues) as myarray from > range(2)").repartition(1).write.parquet("file:///tmp/deleteme1") > // run test > spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code} > Performance measurements: > > On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 > (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec > on Spark 2.3.0 > > Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 > only 2 records are read by this workload, while on Spark 2.3.0 all rows in > the file are read, which appears anomalous. > Example: > {code:java} > bin/spark-shell --master local[*] --driver-memory 2g --packages > ch.cern.sparkmeasure:spark-measure_2.11:0.11 > val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) > stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show()) > {code} > > > Selected metrics from Spark 2.3.0 run: > > {noformat} > elapsedTime => 17849 (18 s) > sum(numTasks) => 11 > sum(recordsRead) => 2 > sum(bytesRead) => 1136448171 (1083.0 MB){noformat} > > > From Spark 2.2.1 run: > > {noformat} > elapsedTime => 1329 (1 s) > sum(numTasks) => 2 > sum(recordsRead) => 2 > sum(bytesRead) => 269162610 (256.0 MB) > {noformat} > > Note: Using Spark built from master (as I write this, June 7th 2018) shows > the same behavior as found in Spark 2.3.0 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org