[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns

2019-03-12 Thread Luca Canali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790409#comment-16790409
 ] 

Luca Canali commented on SPARK-24486:
-

Thanks [~yumwang] for looking at this. Indeed I confirm that using collect 
instead of show is faster. In addition, testing on Spark master (March 12 2019) 
I see that show works fast there too (I have not yet looked at which PR fixed 
this).

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns

2018-09-22 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624723#comment-16624723
 ] 

Yuming Wang commented on SPARK-24486:
-

I can reproduce this issue. I'm working on.

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns

2018-06-07 Thread Luca Canali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505192#comment-16505192
 ] 

Luca Canali commented on SPARK-24486:
-

Thanks for your comment.

In all 3 cases (spark 2.2.1, 2.3.0 and latest version from master) I am using a 
simple test workload to investigate the issue, that is:
spark.read.parquet("file:///tmp/deleteme1").limit(1).show()
The output is simply the first row of the test table, that is an int value "1" 
and an array of 3 int elements.

I'll be happy to provide additional info on the tests and workload. BTW, it 
should be straighforward to reproduce this issue in a test environemnt if you 
can spare the time.

 

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns

2018-06-07 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504845#comment-16504845
 ] 

Kazuaki Ishizaki commented on SPARK-24486:
--

Thank you for reporting a problem.
Could you please let us know which value is shown for each of three results in 
`sum(...)`?

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org