[jira] [Resolved] (SPARK-26140) Enable custom shuffle metrics implementation in shuffle reader

2018-11-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26140.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Enable custom shuffle metrics implementation in shuffle reader
> --
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12

2018-11-23 Thread Anders Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697533#comment-16697533
 ] 

Anders Eriksson commented on SPARK-26146:
-

I also run into this bug. I too could avoid it by reverting from Scala version 
2.12 to 2.11.

> CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
> --
>
> Key: SPARK-26146
> URL: https://issues.apache.org/jira/browse/SPARK-26146
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Jean Georges Perrin
>Priority: Major
>
> Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, 
> where it works ok with Scala v2.11.
> When running a simple CSV ingestion like:{{ }}
> {code:java}
>     // Creates a session on a local master
>     SparkSession spark = SparkSession.builder()
>         .appName("CSV to Dataset")
>         .master("local")
>         .getOrCreate();
>     // Reads a CSV file with header, called books.csv, stores it in a 
> dataframe
>     Dataset df = spark.read().format("csv")
>         .option("header", "true")
>         .load("data/books.csv");
> {code}
>   With Scala 2.12, I get: 
> {code:java}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
> at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
> at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)
> ...
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37)
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21)
> {code}
> Where it works pretty smoothly if I switch back to 2.11.
> Full example available at 
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can 
> modify pom.xml to change easily the Scala version in the property section:
> {code:java}
> 
>  UTF-8
>  1.8
>  2.11
>  2.4.0
> {code}
>  
> (ps. It's my first bug submission, so I hope I did not mess too much, be 
> tolerant if I did)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26142) Implement shuffle read metrics in SQL

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26142:


Assignee: (was: Apache Spark)

> Implement shuffle read metrics in SQL
> -
>
> Key: SPARK-26142
> URL: https://issues.apache.org/jira/browse/SPARK-26142
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26142) Implement shuffle read metrics in SQL

2018-11-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-26142:
---

Assignee: Yuanjian Li

> Implement shuffle read metrics in SQL
> -
>
> Key: SPARK-26142
> URL: https://issues.apache.org/jira/browse/SPARK-26142
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26142) Implement shuffle read metrics in SQL

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697525#comment-16697525
 ] 

Apache Spark commented on SPARK-26142:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/23128

> Implement shuffle read metrics in SQL
> -
>
> Key: SPARK-26142
> URL: https://issues.apache.org/jira/browse/SPARK-26142
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26142) Implement shuffle read metrics in SQL

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26142:


Assignee: Apache Spark

> Implement shuffle read metrics in SQL
> -
>
> Key: SPARK-26142
> URL: https://issues.apache.org/jira/browse/SPARK-26142
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26139) Support passing shuffle metrics to exchange operator

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26139:


Assignee: Reynold Xin  (was: Apache Spark)

> Support passing shuffle metrics to exchange operator
> 
>
> Key: SPARK-26139
> URL: https://issues.apache.org/jira/browse/SPARK-26139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> Due to the way Spark's architected (SQL is defined on top of the RDD API), 
> there are two separate metrics system used in core vs SQL. Ideally, we'd want 
> to be able to get the shuffle metrics for each of the exchange operator 
> independently, e.g. blocks read, number of records.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26139) Support passing shuffle metrics to exchange operator

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697513#comment-16697513
 ] 

Apache Spark commented on SPARK-26139:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/23128

> Support passing shuffle metrics to exchange operator
> 
>
> Key: SPARK-26139
> URL: https://issues.apache.org/jira/browse/SPARK-26139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> Due to the way Spark's architected (SQL is defined on top of the RDD API), 
> there are two separate metrics system used in core vs SQL. Ideally, we'd want 
> to be able to get the shuffle metrics for each of the exchange operator 
> independently, e.g. blocks read, number of records.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26139) Support passing shuffle metrics to exchange operator

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26139:


Assignee: Apache Spark  (was: Reynold Xin)

> Support passing shuffle metrics to exchange operator
> 
>
> Key: SPARK-26139
> URL: https://issues.apache.org/jira/browse/SPARK-26139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> Due to the way Spark's architected (SQL is defined on top of the RDD API), 
> there are two separate metrics system used in core vs SQL. Ideally, we'd want 
> to be able to get the shuffle metrics for each of the exchange operator 
> independently, e.g. blocks read, number of records.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697486#comment-16697486
 ] 

Apache Spark commented on SPARK-26159:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/23127

> Codegen for LocalTableScanExec
> --
>
> Key: SPARK-26159
> URL: https://issues.apache.org/jira/browse/SPARK-26159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26038) Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in long

2018-11-23 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-26038:
-

Assignee: Juliusz Sompolski

> Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in 
> long
> 
>
> Key: SPARK-26038
> URL: https://issues.apache.org/jira/browse/SPARK-26038
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>
> Decimal toScalaBigInt/toJavaBigInteger just called toLong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26038) Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in long

2018-11-23 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26038.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in 
> long
> 
>
> Key: SPARK-26038
> URL: https://issues.apache.org/jira/browse/SPARK-26038
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> Decimal toScalaBigInt/toJavaBigInteger just called toLong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697485#comment-16697485
 ] 

Apache Spark commented on SPARK-26159:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/23127

> Codegen for LocalTableScanExec
> --
>
> Key: SPARK-26159
> URL: https://issues.apache.org/jira/browse/SPARK-26159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26159:


Assignee: (was: Apache Spark)

> Codegen for LocalTableScanExec
> --
>
> Key: SPARK-26159
> URL: https://issues.apache.org/jira/browse/SPARK-26159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26159:


Assignee: Apache Spark

> Codegen for LocalTableScanExec
> --
>
> Key: SPARK-26159
> URL: https://issues.apache.org/jira/browse/SPARK-26159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Major
>
> Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-11-23 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697410#comment-16697410
 ] 

Dongjoon Hyun edited comment on SPARK-20144 at 11/23/18 5:56 PM:
-

Sorry, [~darabos]. IMHO, the proposed way is not consistent with the existing 
Apache Spark design choice. Also, it's not robust enough to be a part of Apache 
Spark because it misleads the user without the guarantee on sort-ness always. 
Lastly, it causes performance degradation because it may try to open many small 
files first. I think you had better add your patch into your Spark build if you 
have.


was (Author: dongjoon):
Sorry, [~darabos]. IMHO, the proposed way is not consistent with the existing 
Apache Spark design choice. Also, it's not robust enough to be a part of Apache 
Spark because it misleads the user without the guarantee on sort-ness always. 
Lastly, it causes performance degradation because it may try open many small 
files first. I think you had better add your patch into your Spark build if you 
have.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-11-23 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697410#comment-16697410
 ] 

Dongjoon Hyun commented on SPARK-20144:
---

Sorry, [~darabos]. IMHO, the proposed way is not consistent with the existing 
Apache Spark design choice. Also, it's not robust enough to be a part of Apache 
Spark because it misleads the user without the guarantee on sort-ness always. 
Lastly, it causes performance degradation because it may try open many small 
files first. I think you had better add your patch into your Spark build if you 
have.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-11-23 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697319#comment-16697319
 ] 

Daniel Darabos commented on SPARK-20144:


So where do we go from here? Should I try to find a reviewer?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26159) Codegen for LocalTableScanExec

2018-11-23 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-26159:
-

 Summary: Codegen for LocalTableScanExec
 Key: SPARK-26159
 URL: https://issues.apache.org/jira/browse/SPARK-26159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


Do codegen for LocalTableScanExec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26108) Support custom lineSep in CSV datasource

2018-11-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26108:
-
Fix Version/s: 3.0.0

> Support custom lineSep in CSV datasource
> 
>
> Key: SPARK-26108
> URL: https://issues.apache.org/jira/browse/SPARK-26108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently CSV datasource can detect and parse CSV text with '\n', '\r' and 
> '\r\n' as line separators. This ticket aims to support custom lineSep with 
> maximum length of 2 characters due to current restriction of uniVocity 
> parser. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21098) Set lineseparator csv multiline and csv write to \n

2018-11-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21098.
--
Resolution: Not A Problem

> Set lineseparator csv multiline and csv write to \n
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.2
>Reporter: Daniel van der Ende
>Priority: Minor
>
> The Univocity-parser library uses the system line ending character as the 
> default line ending character. Rather than remain dependent on the setting in 
> this lib, we could set the default to \n.  We cannot make this configurable 
> for reading as it depends on LineReader from Hadoop, which has a hardcoded \n 
> as line ending.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21289) Text based formats do not support custom end-of-line delimiters

2018-11-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21289.
--
Resolution: Done

CSV, Text and JSON support this option now. Should be resolvable.

> Text based formats do not support custom end-of-line delimiters
> ---
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.1, 2.3.0
>Reporter: Yevgen Galchenko
>Priority: Minor
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26108) Support custom lineSep in CSV datasource

2018-11-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26108:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-21289

> Support custom lineSep in CSV datasource
> 
>
> Key: SPARK-26108
> URL: https://issues.apache.org/jira/browse/SPARK-26108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently CSV datasource can detect and parse CSV text with '\n', '\r' and 
> '\r\n' as line separators. This ticket aims to support custom lineSep with 
> maximum length of 2 characters due to current restriction of uniVocity 
> parser. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26108) Support custom lineSep in CSV datasource

2018-11-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26108.
--
Resolution: Fixed
  Assignee: Maxim Gekk

fixed in https://github.com/apache/spark/pull/23080

> Support custom lineSep in CSV datasource
> 
>
> Key: SPARK-26108
> URL: https://issues.apache.org/jira/browse/SPARK-26108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently CSV datasource can detect and parse CSV text with '\n', '\r' and 
> '\r\n' as line separators. This ticket aims to support custom lineSep with 
> maximum length of 2 characters due to current restriction of uniVocity 
> parser. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26158) Enhance the accuracy of covariance in RowMatrix for DenseVector

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26158:


Assignee: (was: Apache Spark)

> Enhance the accuracy of covariance in RowMatrix for DenseVector
> ---
>
> Key: SPARK-26158
> URL: https://issues.apache.org/jira/browse/SPARK-26158
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Priority: Minor
>
> Compare Spark computeCovariance function in RowMatrix for DenseVector and 
> Numpy's function cov,
> *Find two problem, below is the result:*
> *1)The Spark function computeCovariance in RowMatrix is not accuracy*
> input data
> 1.0,2.0,3.0,4.0,5.0
> 2.0,3.0,1.0,2.0,6.0
> Numpy function cov result:
> [[2.5   1.75]
>  [ 1.75  3.7 ]]
> RowMatrix function computeCovariance result:
> 2.5   1.75  
> 1.75  3.701
>  
> 2)For some input case, the result is not good
> generate input data by below logic
> data1 = np.random.normal(loc=10, scale=0.09, size=1000)
> data2 = np.random.normal(loc=20, scale=0.02,size=1000)
>  
> Numpy function cov result:
> [[  8.10536442e-11  -4.35439574e-15]
> [ -4.35439574e-15   3.99928264e-12]]
>  
> RowMatrix function computeCovariance result:
> -0.0027484893798828125  0.001491546630859375 
> 0.001491546630859375    8.087158203125E-4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26158) Enhance the accuracy of covariance in RowMatrix for DenseVector

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697242#comment-16697242
 ] 

Apache Spark commented on SPARK-26158:
--

User 'KyleLi1985' has created a pull request for this issue:
https://github.com/apache/spark/pull/23126

> Enhance the accuracy of covariance in RowMatrix for DenseVector
> ---
>
> Key: SPARK-26158
> URL: https://issues.apache.org/jira/browse/SPARK-26158
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Priority: Minor
>
> Compare Spark computeCovariance function in RowMatrix for DenseVector and 
> Numpy's function cov,
> *Find two problem, below is the result:*
> *1)The Spark function computeCovariance in RowMatrix is not accuracy*
> input data
> 1.0,2.0,3.0,4.0,5.0
> 2.0,3.0,1.0,2.0,6.0
> Numpy function cov result:
> [[2.5   1.75]
>  [ 1.75  3.7 ]]
> RowMatrix function computeCovariance result:
> 2.5   1.75  
> 1.75  3.701
>  
> 2)For some input case, the result is not good
> generate input data by below logic
> data1 = np.random.normal(loc=10, scale=0.09, size=1000)
> data2 = np.random.normal(loc=20, scale=0.02,size=1000)
>  
> Numpy function cov result:
> [[  8.10536442e-11  -4.35439574e-15]
> [ -4.35439574e-15   3.99928264e-12]]
>  
> RowMatrix function computeCovariance result:
> -0.0027484893798828125  0.001491546630859375 
> 0.001491546630859375    8.087158203125E-4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26158) Enhance the accuracy of covariance in RowMatrix for DenseVector

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26158:


Assignee: Apache Spark

> Enhance the accuracy of covariance in RowMatrix for DenseVector
> ---
>
> Key: SPARK-26158
> URL: https://issues.apache.org/jira/browse/SPARK-26158
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Assignee: Apache Spark
>Priority: Minor
>
> Compare Spark computeCovariance function in RowMatrix for DenseVector and 
> Numpy's function cov,
> *Find two problem, below is the result:*
> *1)The Spark function computeCovariance in RowMatrix is not accuracy*
> input data
> 1.0,2.0,3.0,4.0,5.0
> 2.0,3.0,1.0,2.0,6.0
> Numpy function cov result:
> [[2.5   1.75]
>  [ 1.75  3.7 ]]
> RowMatrix function computeCovariance result:
> 2.5   1.75  
> 1.75  3.701
>  
> 2)For some input case, the result is not good
> generate input data by below logic
> data1 = np.random.normal(loc=10, scale=0.09, size=1000)
> data2 = np.random.normal(loc=20, scale=0.02,size=1000)
>  
> Numpy function cov result:
> [[  8.10536442e-11  -4.35439574e-15]
> [ -4.35439574e-15   3.99928264e-12]]
>  
> RowMatrix function computeCovariance result:
> -0.0027484893798828125  0.001491546630859375 
> 0.001491546630859375    8.087158203125E-4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26158) Enhance the accuracy of covariance in RowMatrix for DenseVector

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697241#comment-16697241
 ] 

Apache Spark commented on SPARK-26158:
--

User 'KyleLi1985' has created a pull request for this issue:
https://github.com/apache/spark/pull/23126

> Enhance the accuracy of covariance in RowMatrix for DenseVector
> ---
>
> Key: SPARK-26158
> URL: https://issues.apache.org/jira/browse/SPARK-26158
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Priority: Minor
>
> Compare Spark computeCovariance function in RowMatrix for DenseVector and 
> Numpy's function cov,
> *Find two problem, below is the result:*
> *1)The Spark function computeCovariance in RowMatrix is not accuracy*
> input data
> 1.0,2.0,3.0,4.0,5.0
> 2.0,3.0,1.0,2.0,6.0
> Numpy function cov result:
> [[2.5   1.75]
>  [ 1.75  3.7 ]]
> RowMatrix function computeCovariance result:
> 2.5   1.75  
> 1.75  3.701
>  
> 2)For some input case, the result is not good
> generate input data by below logic
> data1 = np.random.normal(loc=10, scale=0.09, size=1000)
> data2 = np.random.normal(loc=20, scale=0.02,size=1000)
>  
> Numpy function cov result:
> [[  8.10536442e-11  -4.35439574e-15]
> [ -4.35439574e-15   3.99928264e-12]]
>  
> RowMatrix function computeCovariance result:
> -0.0027484893798828125  0.001491546630859375 
> 0.001491546630859375    8.087158203125E-4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26158) Enhance the accuracy of covariance in RowMatrix for DenseVector

2018-11-23 Thread Liang Li (JIRA)
Liang Li created SPARK-26158:


 Summary: Enhance the accuracy of covariance in RowMatrix for 
DenseVector
 Key: SPARK-26158
 URL: https://issues.apache.org/jira/browse/SPARK-26158
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.4.0
Reporter: Liang Li


Compare Spark computeCovariance function in RowMatrix for DenseVector and 
Numpy's function cov,

*Find two problem, below is the result:*

*1)The Spark function computeCovariance in RowMatrix is not accuracy*

input data

1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0

Numpy function cov result:

[[2.5   1.75]

 [ 1.75  3.7 ]]

RowMatrix function computeCovariance result:

2.5   1.75  

1.75  3.701

 

2)For some input case, the result is not good

generate input data by below logic

data1 = np.random.normal(loc=10, scale=0.09, size=1000)
data2 = np.random.normal(loc=20, scale=0.02,size=1000)

 

Numpy function cov result:

[[  8.10536442e-11  -4.35439574e-15]

[ -4.35439574e-15   3.99928264e-12]]

 

RowMatrix function computeCovariance result:

-0.0027484893798828125  0.001491546630859375 

0.001491546630859375    8.087158203125E-4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2018-11-23 Thread Anderson de Andrade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697213#comment-16697213
 ] 

Anderson de Andrade commented on SPARK-25433:
-

[~fhoering] Care to share your example?

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26117:
---

Assignee: caoxuewen

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26117.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23084
[https://github.com/apache/spark/pull/23084]

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
> Fix For: 3.0.0
>
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26157) Asynchronous execution of stored procedure

2018-11-23 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-26157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaime de Roque Martínez updated SPARK-26157:

Description: 
I am executing a jar file with spark-submit.

This jar file is a scala program, which combines operations spark-related and 
non-spark-related.

The issue comes when I execute a stored procedure from scala using jdbc. This 
SP is in a Microsoft SQL database and, basically, performs some operations and 
populates a table with about 500 rows, one by one.

Then, the next step in the program is read that table and perform some 
additional calculations. This step is grabbing always less rows than created by 
stored procedure, but this is because this step is not properly sync with the 
previous one, starting its execution without waiting the previous step to be 
done.

I have tried:
 * Insert a Thread.sleep(1) between both instructions and{color:#14892c} it 
seems to work{color}.

 * Execute the program just with one Executor => {color:#d04437}it doesn't 
work{color}.

I would like to know why is it happening and how can I solve it without the 
sleep, because that's not a admissible solution.

Thank you very much!!

  was:
I am executing a jar file with spark-submit.

This jar file is a scala program, which combines operations spark-related and 
non-spark-related.

The issue comes when I execute a stored procedure from scala using jdbc. This 
SP is in a Microsoft SQL database and, basically, performs some operations and 
populates a table with about 500 rows, one by one.

Then, the next step in the program is read that table and perform some 
additional calculations. This step is taking always less rows than created by 
stored procedure, but this is because this step is not properly sync with the 
previous one, starting its execution without waiting the previous step to be 
done.

I have tried:
 * Insert a Thread.sleep(1) between both instructions and{color:#14892c} it 
seems to work{color}.

 * Execute the program just with one Executor => {color:#d04437}it doesn't 
work{color}.

I would like to know why is it happening and how can I solve it without the 
sleep, because that's not a admissible solution.

Thank you very much!!


> Asynchronous execution of stored procedure
> --
>
> Key: SPARK-26157
> URL: https://issues.apache.org/jira/browse/SPARK-26157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Jaime de Roque Martínez
>Priority: Major
>
> I am executing a jar file with spark-submit.
> This jar file is a scala program, which combines operations spark-related and 
> non-spark-related.
> The issue comes when I execute a stored procedure from scala using jdbc. This 
> SP is in a Microsoft SQL database and, basically, performs some operations 
> and populates a table with about 500 rows, one by one.
> Then, the next step in the program is read that table and perform some 
> additional calculations. This step is grabbing always less rows than created 
> by stored procedure, but this is because this step is not properly sync with 
> the previous one, starting its execution without waiting the previous step to 
> be done.
> I have tried:
>  * Insert a Thread.sleep(1) between both instructions and{color:#14892c} 
> it seems to work{color}.
>  * Execute the program just with one Executor => {color:#d04437}it doesn't 
> work{color}.
> I would like to know why is it happening and how can I solve it without the 
> sleep, because that's not a admissible solution.
> Thank you very much!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26157) Asynchronous execution of stored procedure

2018-11-23 Thread JIRA
Jaime de Roque Martínez created SPARK-26157:
---

 Summary: Asynchronous execution of stored procedure
 Key: SPARK-26157
 URL: https://issues.apache.org/jira/browse/SPARK-26157
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Jaime de Roque Martínez


I am executing a jar file with spark-submit.

This jar file is a scala program, which combines operations spark-related and 
non-spark-related.

The issue comes when I execute a stored procedure from scala using jdbc. This 
SP is in a Microsoft SQL database and, basically, performs some operations and 
populates a table with about 500 rows, one by one.

Then, the next step in the program is read that table and perform some 
additional calculations. This step is taking always less rows than created by 
stored procedure, but this is because this step is not properly sync with the 
previous one, starting its execution without waiting the previous step to be 
done.

I have tried:
 * Insert a Thread.sleep(1) between both instructions and{color:#14892c} it 
seems to work{color}.

 * Execute the program just with one Executor => {color:#d04437}it doesn't 
work{color}.

I would like to know why is it happening and how can I solve it without the 
sleep, because that's not a admissible solution.

Thank you very much!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-23 Thread xuqianjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696638#comment-16696638
 ] 

xuqianjin commented on SPARK-23410:
---

hi [~hyukjin.kwon] At present, most isuses of flink are SQL Table Api, and few 
people review other modules. The people who review other modules are too busy 
for me, otherwise I could merge them.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696593#comment-16696593
 ] 

Hyukjin Kwon commented on SPARK-23410:
--

That's not even merged yet.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26156) Revise summary section of stage page

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696551#comment-16696551
 ] 

Apache Spark commented on SPARK-26156:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23125

> Revise summary section of stage page
> 
>
> Key: SPARK-26156
> URL: https://issues.apache.org/jira/browse/SPARK-26156
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> 1. In the summary section of stage page, the following metrics names can be 
> revised:
> Output => Output Size / Records
> Shuffle Read: => Shuffle Read Size / Records
> Shuffle Write => Shuffle Write Size / Records
> After changes, the names are more clear, and consistent with the other names 
> in the same page.
> 2. The associated job id URL should not contain the 3 tails spaces. Reduce 
> the number of spaces to one, and exclude the space from link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26156) Revise summary section of stage page

2018-11-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696549#comment-16696549
 ] 

Apache Spark commented on SPARK-26156:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23125

> Revise summary section of stage page
> 
>
> Key: SPARK-26156
> URL: https://issues.apache.org/jira/browse/SPARK-26156
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> 1. In the summary section of stage page, the following metrics names can be 
> revised:
> Output => Output Size / Records
> Shuffle Read: => Shuffle Read Size / Records
> Shuffle Write => Shuffle Write Size / Records
> After changes, the names are more clear, and consistent with the other names 
> in the same page.
> 2. The associated job id URL should not contain the 3 tails spaces. Reduce 
> the number of spaces to one, and exclude the space from link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26156) Revise summary section of stage page

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26156:


Assignee: (was: Apache Spark)

> Revise summary section of stage page
> 
>
> Key: SPARK-26156
> URL: https://issues.apache.org/jira/browse/SPARK-26156
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> 1. In the summary section of stage page, the following metrics names can be 
> revised:
> Output => Output Size / Records
> Shuffle Read: => Shuffle Read Size / Records
> Shuffle Write => Shuffle Write Size / Records
> After changes, the names are more clear, and consistent with the other names 
> in the same page.
> 2. The associated job id URL should not contain the 3 tails spaces. Reduce 
> the number of spaces to one, and exclude the space from link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26156) Revise summary section of stage page

2018-11-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26156:


Assignee: Apache Spark

> Revise summary section of stage page
> 
>
> Key: SPARK-26156
> URL: https://issues.apache.org/jira/browse/SPARK-26156
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> 1. In the summary section of stage page, the following metrics names can be 
> revised:
> Output => Output Size / Records
> Shuffle Read: => Shuffle Read Size / Records
> Shuffle Write => Shuffle Write Size / Records
> After changes, the names are more clear, and consistent with the other names 
> in the same page.
> 2. The associated job id URL should not contain the 3 tails spaces. Reduce 
> the number of spaces to one, and exclude the space from link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-23 Thread Adrian Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696541#comment-16696541
 ] 

Adrian Wang commented on SPARK-26155:
-

It seems the performance downgrade is related to CPU cache, the metrics 
collection happens to break that...

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486 & 487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26156) Revise summary section of stage page

2018-11-23 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-26156:
--

 Summary: Revise summary section of stage page
 Key: SPARK-26156
 URL: https://issues.apache.org/jira/browse/SPARK-26156
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Gengliang Wang


1. In the summary section of stage page, the following metrics names can be 
revised:
Output => Output Size / Records
Shuffle Read: => Shuffle Read Size / Records
Shuffle Write => Shuffle Write Size / Records

After changes, the names are more clear, and consistent with the other names in 
the same page.

2. The associated job id URL should not contain the 3 tails spaces. Reduce the 
number of spaces to one, and exclude the space from link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-23 Thread xuqianjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696542#comment-16696542
 ] 

xuqianjin commented on SPARK-23410:
---

hi [~hyukjin.kwon]  this the PR [https://github.com/apache/flink/pull/7157]

I have been reviewed twice before.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-23 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Attachment: q19.sql
Q19 analysis in Spark2.3 without L486 & 487.pdf
Q19 analysis in Spark2.3 with L486&487.pdf

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486 & 487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-23 Thread Pierre Lienhart (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696526#comment-16696526
 ] 

Pierre Lienhart commented on SPARK-26116:
-

I just enhanced the ticket description.

> Spark SQL - Sort when writing partitioned parquet leads to OOM errors
> -
>
> Key: SPARK-26116
> URL: https://issues.apache.org/jira/browse/SPARK-26116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pierre Lienhart
>Priority: Major
>
> When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
> sorts each partition before writing but this sort consumes a huge amount of 
> memory compared to the size of the data. The executors can then go OOM and 
> get killed by YARN. As a consequence, it also forces to provision huge 
> amounts of memory compared to the data to be written.
> Error messages found in the Spark UI are like the following :
> {code:java}
> Spark UI description of failure : Job aborted due to stage failure: Task 169 
> in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 
> 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure 
> (executor 1 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
>  
> {code:java}
> Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
> executor 1): org.apache.spark.SparkException: Task failed while writing rows
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
> /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
>  (No such file or directory)
>  at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)
>  at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)
>  at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)
>  at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>  ... 8 more{code}
>  
> In the stderr logs, we can see that huge amount of sort data (the partition 
> being sorted here is 250 MB when persisted into memory, deserialized) is 
> being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling 
> sort data of 3.6 GB to disk}}). Sometimes the 

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-23 Thread Pierre Lienhart (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also forces to provision huge amounts of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 
In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}. Even when it works, 
GC time is pretty high (~20% of the total write task duration) and I guess that 
these disk spills further impair performance.  

Contrary to what the above error message suggests, 

[jira] [Comment Edited] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-23 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696476#comment-16696476
 ] 

Ke Jia edited comment on SPARK-26155 at 11/23/18 7:58 AM:
--

*Cluster info:*
| |*Master Node*|*Worker Nodes*|
|*Node*|1x |7x|
|*Processor*|Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz|Intel(R) Xeon(R) 
Platinum 8180 CPU @ 2.50GHz|
|*Memory*|192 GB|384 GB|
|*Storage Main*|8 x 960G SSD|8 x 960G SSD|
|*Network*|10Gbe|
|*Role*|CM Management 
 NameNode
 Secondary NameNode
 Resource Manager
 Hive Metastore Server|DataNode
 NodeManager|
|*OS Version*|CentOS 7.2| CentOS 7.2|
|*Hadoop*|Apache Hadoop 2.7.5| Apache Hadoop 2.7.5|
|*Hive*|Apache Hive 2.2.0| |
|*Spark*|Apache Spark 2.1.0  & Apache Spark2.3.0| |
|*JDK  version*|1.8.0_112| 1.8.0_112|

*Related parameters setting:*
|*Component*|*Parameter*|*Value*|
|*Yarn Resource Manager*|yarn.scheduler.maximum-allocation-mb|40GB|
| |yarn.scheduler.minimum-allocation-mb|1GB|
| |yarn.scheduler.maximum-allocation-vcores|121|
| |Yarn.resourcemanager.scheduler.class|Fair Scheduler|
|*Yarn Node Manager*|yarn.nodemanager.resource.memory-mb|40GB|
| |yarn.nodemanager.resource.cpu-vcores|121|
|*Spark*|spark.executor.memory|34GB|
| |spark.executor.cores|40|


was (Author: jk_self):
*Cluster info:*
| |*Master Node*|*Worker Nodes* |
|*Node*|1x |7x|
|*Processor*|Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz|Intel(R) Xeon(R) 
Platinum 8180 CPU @ 2.50GHz|
|*Memory*|192 GB|384 GB|
|*Storage Main*|8 x 960G SSD|8 x 960G SSD|
|*Network*|10Gbe|
|*Role*|CM Management 
 NameNode
 Secondary NameNode
 Resource Manager
 Hive Metastore Server|DataNode
 NodeManager|
|*OS Version*|CentOS 7.2|
|*Hadoop*|Apache Hadoop 2.7.5|
|*Hive*|Apache Hive 2.2.0|
|*Spark*|Apache Spark 2.1.0  VS Apache Spark2.3.0|
|*JDK  version*|1.8.0_112|

*Related parameters setting:*
|*Component*|*Parameter*|*Value*|
|*Yarn Resource Manager*|yarn.scheduler.maximum-allocation-mb|40GB|
|yarn.scheduler.minimum-allocation-mb|1GB|
|yarn.scheduler.maximum-allocation-vcores|121|
|Yarn.resourcemanager.scheduler.class|Fair Scheduler|
|*Yarn Node Manager*|yarn.nodemanager.resource.memory-mb|40GB|
|yarn.nodemanager.resource.cpu-vcores|121|
|*Spark*|spark.executor.memory|34GB|
|spark.executor.cores|40|

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.

2018-11-23 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-26059:

Description: 
In order to reproduce submit a failing job to spark standalone master. The 
status for the failed job is shown as FINISHED, irrespective of the fact it 
failed or succeeded. 

EDIT: It happens only when deploy-mode is client, and when deploy mode is 
cluster it works as expected.

  was:In order to reproduce submit a failing job to spark standalone master. 
The status for the failed job is shown as FINISHED, irrespective of the fact it 
failed or succeeded. 


> Spark standalone mode, does not correctly record a failed Spark Job.
> 
>
> Key: SPARK-26059
> URL: https://issues.apache.org/jira/browse/SPARK-26059
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> In order to reproduce submit a failing job to spark standalone master. The 
> status for the failed job is shown as FINISHED, irrespective of the fact it 
> failed or succeeded. 
> EDIT: It happens only when deploy-mode is client, and when deploy mode is 
> cluster it works as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org