[jira] [Commented] (SPARK-32965) pyspark reading csv files with utf_16le encoding

2020-09-22 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200578#comment-17200578
 ] 

Takeshi Yamamuro commented on SPARK-32965:
--

Is this issue almost the same with SPARK-32961?

> pyspark reading csv files with utf_16le encoding
> 
>
> Key: SPARK-32965
> URL: https://issues.apache.org/jira/browse/SPARK-32965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
>
> If you have a file encoded in utf_16le or utf_16be and try to use 
> spark.read.csv("", encoding="utf_16le") the dataframe isn't 
> rendered properly
> if you use python decoding like:
> prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : 
> x.decode("utf_16le").splitlines())
> and then do spark.read.csv(prdd), then it works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32959) Fix the "Relation: view text" test in DataSourceV2SQLSuite

2020-09-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32959.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29811
[https://github.com/apache/spark/pull/29811]

> Fix the "Relation: view text" test in DataSourceV2SQLSuite
> --
>
> Key: SPARK-32959
> URL: https://issues.apache.org/jira/browse/SPARK-32959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> The existing code just defines a function literal and doesn't execute it:
> {code:java}
> test("Relation: view text") {
>   val t1 = "testcat.ns1.ns2.tbl"
>   withTable(t1) {
> withView("view1") { v1: String =>
>   sql(s"CREATE TABLE $t1 USING foo AS SELECT id, data FROM source")
>   sql(s"CREATE VIEW $v1 AS SELECT * from $t1")
>   checkAnswer(sql(s"TABLE $v1"), spark.table("source"))
> }
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32959) Fix the "Relation: view text" test in DataSourceV2SQLSuite

2020-09-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32959:
---

Assignee: Terry Kim

> Fix the "Relation: view text" test in DataSourceV2SQLSuite
> --
>
> Key: SPARK-32959
> URL: https://issues.apache.org/jira/browse/SPARK-32959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> The existing code just defines a function literal and doesn't execute it:
> {code:java}
> test("Relation: view text") {
>   val t1 = "testcat.ns1.ns2.tbl"
>   withTable(t1) {
> withView("view1") { v1: String =>
>   sql(s"CREATE TABLE $t1 USING foo AS SELECT id, data FROM source")
>   sql(s"CREATE VIEW $v1 AS SELECT * from $t1")
>   checkAnswer(sql(s"TABLE $v1"), spark.table("source"))
> }
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32966) Spark| PartitionBy is taking long time to process

2020-09-22 Thread Sujit Das (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujit Das updated SPARK-32966:
--
Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5  (was: EMR - 
5.30.0; Hadoop -2.8.5; Spark- 2.4.5)

> Spark| PartitionBy is taking long time to process
> -
>
> Key: SPARK-32966
> URL: https://issues.apache.org/jira/browse/SPARK-32966
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5
>Reporter: Sujit Das
>Priority: Major
>  Labels: AWS, pyspark, spark-conf
>
> # When I do a write without any partition it takes 8 min
> df2_merge.write.mode('overwrite').parquet(dest_path)
>  
>        2. I have added conf - 
> spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time 
> (more than 50 min before I force terminated the EMR cluster). But I have 
> observed the partitions have been created and data files are present. But in 
> EMR cluster the process is still showing as running, where as in spark 
> history server it is showing no running or pending process.
> df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>       3. I have modified with new conf - spark.sql.shuffle.partitions=3; it 
> took 24 min
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>      4. Again I disabled the conf and run plain write with partition. It took 
> 30 min.
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
> Only one conf is common in the above scenarios is 
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
> My point is to reduce the time of writing with partitionBy. Is there anything 
> I am missing
>  
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32966) Spark| PartitionBy is taking long time to process

2020-09-22 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200571#comment-17200571
 ] 

Takeshi Yamamuro commented on SPARK-32966:
--

Is this a question? At least, I think you need to describe more info (e.g., a 
complete query to reproduce the issue).

> Spark| PartitionBy is taking long time to process
> -
>
> Key: SPARK-32966
> URL: https://issues.apache.org/jira/browse/SPARK-32966
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5
>Reporter: Sujit Das
>Priority: Major
>  Labels: AWS, pyspark, spark-conf
>
> # When I do a write without any partition it takes 8 min
> df2_merge.write.mode('overwrite').parquet(dest_path)
>  
>        2. I have added conf - 
> spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time 
> (more than 50 min before I force terminated the EMR cluster). But I have 
> observed the partitions have been created and data files are present. But in 
> EMR cluster the process is still showing as running, where as in spark 
> history server it is showing no running or pending process.
> df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>       3. I have modified with new conf - spark.sql.shuffle.partitions=3; it 
> took 24 min
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>      4. Again I disabled the conf and run plain write with partition. It took 
> 30 min.
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
> Only one conf is common in the above scenarios is 
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
> My point is to reduce the time of writing with partitionBy. Is there anything 
> I am missing
>  
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32966) Spark| PartitionBy is taking long time to process

2020-09-22 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32966.
--
Resolution: Invalid

> Spark| PartitionBy is taking long time to process
> -
>
> Key: SPARK-32966
> URL: https://issues.apache.org/jira/browse/SPARK-32966
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5
>Reporter: Sujit Das
>Priority: Major
>  Labels: AWS, pyspark, spark-conf
>
> # When I do a write without any partition it takes 8 min
> df2_merge.write.mode('overwrite').parquet(dest_path)
>  
>        2. I have added conf - 
> spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time 
> (more than 50 min before I force terminated the EMR cluster). But I have 
> observed the partitions have been created and data files are present. But in 
> EMR cluster the process is still showing as running, where as in spark 
> history server it is showing no running or pending process.
> df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>       3. I have modified with new conf - spark.sql.shuffle.partitions=3; it 
> took 24 min
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>      4. Again I disabled the conf and run plain write with partition. It took 
> 30 min.
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
> Only one conf is common in the above scenarios is 
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
> My point is to reduce the time of writing with partitionBy. Is there anything 
> I am missing
>  
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200564#comment-17200564
 ] 

Sean Malory commented on SPARK-32306:
-

Thank you.

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Sean Malory
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-22 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32961:
-
Component/s: (was: Spark Core)
 SQL

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-22 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200551#comment-17200551
 ] 

Takeshi Yamamuro commented on SPARK-32961:
--

cc: [~yumwang]

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-22 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32778:
-
Issue Type: Improvement  (was: Bug)

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31618) Pushdown Distinct through Join in IntersectDistinct based on stats

2020-09-22 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31618.
--
Resolution: Won't Fix

I'll close this because the corresponding PR has been closed.

> Pushdown Distinct through Join in IntersectDistinct based on stats
> --
>
> Key: SPARK-31618
> URL: https://issues.apache.org/jira/browse/SPARK-31618
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Prakhar Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32870) Make sure that all expressions have their ExpressionDescription properly filled

2020-09-22 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32870.
--
Fix Version/s: 3.1.0
 Assignee: Tanel Kiis
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29743

> Make sure that all expressions have their ExpressionDescription properly 
> filled
> ---
>
> Key: SPARK-32870
> URL: https://issues.apache.org/jira/browse/SPARK-32870
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.1.0
>
>
> Make sure that all SQL expressions have their usage, examples and since 
> filled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-22 Thread Bui Bao Anh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bui Bao Anh updated SPARK-32961:

Attachment: sendo_sample.csv

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2020-09-22 Thread Neelesh Srinivas Salian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194557#comment-17194557
 ] 

Neelesh Srinivas Salian edited comment on SPARK-27872 at 9/23/20, 12:36 AM:


I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. 
 Should I add it here or a new cloned issue? [~eje]

Have this PR: https://github.com/apache/spark/pull/29844


was (Author: nssalian):
I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. 
Should I add it here or a new cloned issue? [~eje]

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2020-09-22 Thread Neelesh Srinivas Salian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194557#comment-17194557
 ] 

Neelesh Srinivas Salian edited comment on SPARK-27872 at 9/23/20, 12:36 AM:


I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. 
 Should I add it here or a new cloned issue? [~eje]

[|https://github.com/apache/spark/pull/29844]


was (Author: nssalian):
I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. 
 Should I add it here or a new cloned issue? [~eje]

Have this PR: https://github.com/apache/spark/pull/29844

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2020-09-22 Thread Neelesh Srinivas Salian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Srinivas Salian updated SPARK-27872:

Comment: was deleted

(was: I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. 
 Should I add it here or a new cloned issue? [~eje]

[|https://github.com/apache/spark/pull/29844])

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2020-09-22 Thread Neelesh Srinivas Salian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194557#comment-17194557
 ] 

Neelesh Srinivas Salian edited comment on SPARK-27872 at 9/23/20, 12:36 AM:


I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. 
 Should I add it here or a new cloned issue? [~eje]

[|https://github.com/apache/spark/pull/29844]


was (Author: nssalian):
I have a patch to add this fix to the 2.4.x (currently 2.4.6) release. 
 Should I add it here or a new cloned issue? [~eje]

[|https://github.com/apache/spark/pull/29844]

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32017.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29703
[https://github.com/apache/spark/pull/29703]

> Make Pyspark Hadoop 3.2+ Variant available in PyPI
> --
>
> Key: SPARK-32017
> URL: https://issues.apache.org/jira/browse/SPARK-32017
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: George Pongracz
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> The version of Pyspark 3.0.0 currently available in PyPI currently uses 
> hadoop 2.7.4.
> Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 
> as per the downloadable spark binaries.
> This would enable the PyPI version to be compatible with session token 
> authorisations and assist in accessing data residing in object stores with 
> stronger encryption methods.
> If not PyPI then as a tar file in the apache download archives at the least 
> please.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32017) Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32017:


Assignee: Hyukjin Kwon

> Make Pyspark Hadoop 3.2+ Variant available in PyPI
> --
>
> Key: SPARK-32017
> URL: https://issues.apache.org/jira/browse/SPARK-32017
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: George Pongracz
>Assignee: Hyukjin Kwon
>Priority: Major
>
> The version of Pyspark 3.0.0 currently available in PyPI currently uses 
> hadoop 2.7.4.
> Could a variant (or the default) have its version of Hadoop aligned to 3.2.0 
> as per the downloadable spark binaries.
> This would enable the PyPI version to be compatible with session token 
> authorisations and assist in accessing data residing in object stores with 
> stronger encryption methods.
> If not PyPI then as a tar file in the apache download archives at the least 
> please.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32933.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29799
[https://github.com/apache/spark/pull/29799]

> Use keyword-only syntax for keyword_only methods
> 
>
> Key: SPARK-32933
> URL: https://issues.apache.org/jira/browse/SPARK-32933
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.1.0
>
>
> Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
> 3102|https://www.python.org/dev/peps/pep-3102/]).
> It is not a full replacement for our current usage of {{keyword_only}}, but 
> it would allow us to make our expectations explicit:
> {code:python}
> @keyword_only
> def __init__(self, degree=2, inputCol=None, outputCol=None):
> {code}
> {code:python}
> @keyword_only
> def __init__(self, *, degree=2, inputCol=None, outputCol=None):
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32933:


Assignee: Maciej Szymkiewicz

> Use keyword-only syntax for keyword_only methods
> 
>
> Key: SPARK-32933
> URL: https://issues.apache.org/jira/browse/SPARK-32933
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
> 3102|https://www.python.org/dev/peps/pep-3102/]).
> It is not a full replacement for our current usage of {{keyword_only}}, but 
> it would allow us to make our expectations explicit:
> {code:python}
> @keyword_only
> def __init__(self, degree=2, inputCol=None, outputCol=None):
> {code}
> {code:python}
> @keyword_only
> def __init__(self, *, degree=2, inputCol=None, outputCol=None):
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200439#comment-17200439
 ] 

Apache Spark commented on SPARK-27872:
--

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/29844

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200438#comment-17200438
 ] 

Apache Spark commented on SPARK-27872:
--

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/29844

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17556) Executor side broadcast for broadcast joins

2020-09-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-17556:
---

Assignee: (was: L. C. Hsieh)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Priority: Major
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17556) Executor side broadcast for broadcast joins

2020-09-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-17556:

Comment: was deleted

(was: We will recently try to pick this up again.)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Priority: Major
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-22 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Description: 
With AQE, local shuffle reader breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
-Coalescing shuffle partitions could also break it.-

  was:
With AQE, local shuffle reader breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
Coalescing shuffle partitions could also break it.


> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Major
>
> With AQE, local shuffle reader breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> -Coalescing shuffle partitions could also break it.-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file

2020-09-22 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200382#comment-17200382
 ] 

Chen Zhang commented on SPARK-32956:


Okay, I will submit a PR later.

> Duplicate Columns in a csv file
> ---
>
> Key: SPARK-32956
> URL: https://issues.apache.org/jira/browse/SPARK-32956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
>
> Imagine a csv file shaped like:
> 
> Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price
> 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728"
> 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644"
> =
> Reading this with the header=True will result in a stacktrace.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29250) Upgrade to Hadoop 3.2.1

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200366#comment-17200366
 ] 

Apache Spark commented on SPARK-29250:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29843

> Upgrade to Hadoop 3.2.1
> ---
>
> Key: SPARK-29250
> URL: https://issues.apache.org/jira/browse/SPARK-29250
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29250) Upgrade to Hadoop 3.2.1

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29250:


Assignee: Apache Spark

> Upgrade to Hadoop 3.2.1
> ---
>
> Key: SPARK-29250
> URL: https://issues.apache.org/jira/browse/SPARK-29250
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29250) Upgrade to Hadoop 3.2.1

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29250:


Assignee: (was: Apache Spark)

> Upgrade to Hadoop 3.2.1
> ---
>
> Key: SPARK-29250
> URL: https://issues.apache.org/jira/browse/SPARK-29250
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29250) Upgrade to Hadoop 3.2.1

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200365#comment-17200365
 ] 

Apache Spark commented on SPARK-29250:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29843

> Upgrade to Hadoop 3.2.1
> ---
>
> Key: SPARK-29250
> URL: https://issues.apache.org/jira/browse/SPARK-29250
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32306:

Issue Type: Documentation  (was: Bug)

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Sean Malory
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200337#comment-17200337
 ] 

L. C. Hsieh commented on SPARK-32306:
-

Resolved by https://github.com/apache/spark/pull/29835.

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32306:

Affects Version/s: 3.1.0
   3.0.0

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Sean Malory
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-32306.
-
Fix Version/s: 3.1.0
 Assignee: Maxim Gekk
   Resolution: Fixed

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32019) Add spark.sql.files.minPartitionNum config

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200327#comment-17200327
 ] 

Apache Spark commented on SPARK-32019:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29842

> Add spark.sql.files.minPartitionNum config
> --
>
> Key: SPARK-32019
> URL: https://issues.apache.org/jira/browse/SPARK-32019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32019) Add spark.sql.files.minPartitionNum config

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200325#comment-17200325
 ] 

Apache Spark commented on SPARK-32019:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29842

> Add spark.sql.files.minPartitionNum config
> --
>
> Key: SPARK-32019
> URL: https://issues.apache.org/jira/browse/SPARK-32019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32019) Add spark.sql.files.minPartitionNum config

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200326#comment-17200326
 ] 

Apache Spark commented on SPARK-32019:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29842

> Add spark.sql.files.minPartitionNum config
> --
>
> Key: SPARK-32019
> URL: https://issues.apache.org/jira/browse/SPARK-32019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32970:


Assignee: (was: Apache Spark)

> Reduce the runtime of unit test for SPARK-32019
> ---
>
> Key: SPARK-32970
> URL: https://issues.apache.org/jira/browse/SPARK-32970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: Test
>
> The UT for SPARK-32019 can run over 7 minutes on jenkins.
> This sort of simple UT should run in few seconds - definitely less than a 
> minute. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200323#comment-17200323
 ] 

Apache Spark commented on SPARK-32970:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29842

> Reduce the runtime of unit test for SPARK-32019
> ---
>
> Key: SPARK-32970
> URL: https://issues.apache.org/jira/browse/SPARK-32970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: Test
>
> The UT for SPARK-32019 can run over 7 minutes on jenkins.
> This sort of simple UT should run in few seconds - definitely less than a 
> minute. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32970:


Assignee: Apache Spark

> Reduce the runtime of unit test for SPARK-32019
> ---
>
> Key: SPARK-32970
> URL: https://issues.apache.org/jira/browse/SPARK-32970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>  Labels: Test
>
> The UT for SPARK-32019 can run over 7 minutes on jenkins.
> This sort of simple UT should run in few seconds - definitely less than a 
> minute. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-09-22 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200307#comment-17200307
 ] 

Xinli Shang commented on SPARK-27733:
-

We talked about the Parquet 1.11.0 adoption in Spark in today's Parquet 
community sync meeting. The Parquet community would like to help if there is 
any way to move faster. [~csun][~smilegator][~dongjoon][~iemejia] and others, 
are you interested in joining our next Parquet meeting to brainstorm solutions 
to move forward? 

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32970) Reduce the runtime of unit test for SPARK-32019

2020-09-22 Thread Tanel Kiis (Jira)
Tanel Kiis created SPARK-32970:
--

 Summary: Reduce the runtime of unit test for SPARK-32019
 Key: SPARK-32970
 URL: https://issues.apache.org/jira/browse/SPARK-32970
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.1.0
Reporter: Tanel Kiis


The UT for SPARK-32019 can run over 7 minutes on jenkins.
This sort of simple UT should run in few seconds - definitely less than a 
minute. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32969) Spark Submit process not exiting after session.stop()

2020-09-22 Thread El R (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

El R updated SPARK-32969:
-
Affects Version/s: (was: 3.0.1)

> Spark Submit process not exiting after session.stop()
> -
>
> Key: SPARK-32969
> URL: https://issues.apache.org/jira/browse/SPARK-32969
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.4.7
>Reporter: El R
>Priority: Critical
>
> Exactly 3 spark submit processes are hanging from the first 3 jobs that were 
> submitted to the standalone cluster using client mode. Example from the 
> client:
> {code:java}
> root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
>  -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
> --conf spark.master=spark://3c520b0c6d6e:7077 --conf 
> spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
>  --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
> --conf spark.fileserver.port=46102 --conf 
> packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
> spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
> spark.replClassServer.port=46104 --conf 
> spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
> spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
> spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
> pyspark-shell 
> root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
>  -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
> --conf spark.master=spark://3c520b0c6d6e:7077 --conf 
> spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
>  --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
> --conf spark.fileserver.port=46102 --conf 
> packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
> spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
> spark.replClassServer.port=46104 --conf 
> spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
> spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
> spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
> pyspark-shell 
> root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
>  -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
> --conf spark.master=spark://3c520b0c6d6e:7077 --conf 
> spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
>  --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
> --conf spark.fileserver.port=46102 --conf 
> packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
> spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
> spark.replClassServer.port=46104 --conf 
> spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
> spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True 

[jira] [Created] (SPARK-32969) Spark Submit process not exiting after session.stop()

2020-09-22 Thread El R (Jira)
El R created SPARK-32969:


 Summary: Spark Submit process not exiting after session.stop()
 Key: SPARK-32969
 URL: https://issues.apache.org/jira/browse/SPARK-32969
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 3.0.1, 2.4.7
Reporter: El R


Exactly 3 spark submit processes are hanging from the first 3 jobs that were 
submitted to the standalone cluster using client mode. Example from the client:
{code:java}
root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell 
root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell 
root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell
 
{code}
The corresponding jobs are showing as 'completed' in spark UI and have closed 
their sessions & exited according to their logs. No worker resources are being 
consumed by these jobs anymore & subsequent jobs are able to receive 

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2020-09-22 Thread Igor Kamyshnikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200284#comment-17200284
 ] 

Igor Kamyshnikov commented on SPARK-20525:
--

I bet the issue is in JDK, but it could be solved in scala if they get rid of 
writeReplace/List$SerializationProxy. I've left some details 
[here|https://issues.apache.org/jira/browse/SPARK-19938?focusedCommentId=17200272=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17200272]
 in SPARK-19938.

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
>Priority: Major
>  Labels: bulk-closed
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   

[jira] [Resolved] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32964.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29836
[https://github.com/apache/spark/pull/29836]

> Pass all `streaming` module UTs in Scala 2.13
> -
>
> Key: SPARK-32964
> URL: https://issues.apache.org/jira/browse/SPARK-32964
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.1.0
>
>
> There is only one failed case of `streaming` module in Scala 2.13:
>  * `start with non-serializable DStream checkpoint ` in StreamingContextSuite
> StackOverflowError is thrown here when SerializationDebugger#visit method is 
> called.
> The error stack as follow:
> {code:java}
> Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownExpected exception 
> java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownScalaTestFailureLocation: 
> org.apache.spark.streaming.StreamingContextSuite at 
> (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException:
>  Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrown at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 
> at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
>  at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
> org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
> org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
>  ...Caused by: java.lang.StackOverflowError at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
> sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
>  at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
>  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
> scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
>  at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> 

[jira] [Assigned] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32964:
-

Assignee: Yang Jie

> Pass all `streaming` module UTs in Scala 2.13
> -
>
> Key: SPARK-32964
> URL: https://issues.apache.org/jira/browse/SPARK-32964
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> There is only one failed case of `streaming` module in Scala 2.13:
>  * `start with non-serializable DStream checkpoint ` in StreamingContextSuite
> StackOverflowError is thrown here when SerializationDebugger#visit method is 
> called.
> The error stack as follow:
> {code:java}
> Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownExpected exception 
> java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownScalaTestFailureLocation: 
> org.apache.spark.streaming.StreamingContextSuite at 
> (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException:
>  Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrown at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 
> at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
>  at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
> org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
> org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
>  ...Caused by: java.lang.StackOverflowError at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
> sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
>  at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
>  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
> scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
>  at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> 

[jira] [Comment Edited] (SPARK-19938) java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field

2020-09-22 Thread Igor Kamyshnikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200272#comment-17200272
 ] 

Igor Kamyshnikov edited comment on SPARK-19938 at 9/22/20, 5:55 PM:


[~rdblue], my analysis shows the different root cause of the problem:
https://bugs.openjdk.java.net/browse/JDK-8024931 (never fixed)
https://github.com/scala/bug/issues/9777 (asking scala to solve on their side)

It's about circular references among the objects being serialized:

RDD1.dependencies_ = Seq1[RDD2]
RDD2.dependences_ = Seq2[RDD3]
RDD3 with some Dataset/catalyst magic can refer back to the Seq1[RDD2]

Seq are instances of scala.collection.immutable.List which uses writeReplace, 
giving an instance of 'SerializationProxy'. The serialization of RDD3 puts a 
reference to the Seq1's SerializationProxy. When the deserialization works, it 
reads that reference to SerializationProxy earlier than the 'readResolve' 
method is called (see the JDK bug reported).


was (Author: kamyshnikov):
[~rdblue], my analysis shows the different root cause of the problem:
https://bugs.openjdk.java.net/browse/JDK-8024931
https://github.com/scala/bug/issues/9777

It's about circular references among the objects being serialized:

RDD1.dependencies_ = Seq1[RDD2]
RDD2.dependences_ = Seq2[RDD3]
RDD3 with some Dataset/catalyst magic can refer back to the Seq1[RDD2]

Seq are instances of scala.collection.immutable.List which uses writeReplace, 
giving an instance of 'SerializationProxy'. The serialization of RDD3 puts a 
reference to the Seq1's SerializationProxy. When the deserialization works, it 
reads that reference to SerializationProxy earlier than the 'readResolve' 
method is called (see the JDK bug reported).

> java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field
> ---
>
> Key: SPARK-19938
> URL: https://issues.apache.org/jira/browse/SPARK-19938
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2
>Reporter: srinivas thallam
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19938) java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field

2020-09-22 Thread Igor Kamyshnikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200272#comment-17200272
 ] 

Igor Kamyshnikov commented on SPARK-19938:
--

[~rdblue], my analysis shows the different root cause of the problem:
https://bugs.openjdk.java.net/browse/JDK-8024931
https://github.com/scala/bug/issues/9777

It's about circular references among the objects being serialized:

RDD1.dependencies_ = Seq1[RDD2]
RDD2.dependences_ = Seq2[RDD3]
RDD3 with some Dataset/catalyst magic can refer back to the Seq1[RDD2]

Seq are instances of scala.collection.immutable.List which uses writeReplace, 
giving an instance of 'SerializationProxy'. The serialization of RDD3 puts a 
reference to the Seq1's SerializationProxy. When the deserialization works, it 
reads that reference to SerializationProxy earlier than the 'readResolve' 
method is called (see the JDK bug reported).

> java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field
> ---
>
> Key: SPARK-19938
> URL: https://issues.apache.org/jira/browse/SPARK-19938
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2
>Reporter: srinivas thallam
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32968) Column pruning for CsvToStructs

2020-09-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32968:

Description: 
We could do column pruning for CsvToStructs expression if we only require some 
fields from it.


> Column pruning for CsvToStructs
> ---
>
> Key: SPARK-32968
> URL: https://issues.apache.org/jira/browse/SPARK-32968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> We could do column pruning for CsvToStructs expression if we only require 
> some fields from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32968) Column pruning for CsvToStructs

2020-09-22 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-32968:
---

 Summary: Column pruning for CsvToStructs
 Key: SPARK-32968
 URL: https://issues.apache.org/jira/browse/SPARK-32968
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32967) Optimize csv expression chain

2020-09-22 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-32967:
---

 Summary: Optimize csv expression chain
 Key: SPARK-32967
 URL: https://issues.apache.org/jira/browse/SPARK-32967
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Like json, we could do the same optimization to csv expression chain, e.g. 
from_csv + to_csv.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32966) Spark| PartitionBy is taking long time to process

2020-09-22 Thread Sujit Das (Jira)
Sujit Das created SPARK-32966:
-

 Summary: Spark| PartitionBy is taking long time to process
 Key: SPARK-32966
 URL: https://issues.apache.org/jira/browse/SPARK-32966
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.5
 Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5
Reporter: Sujit Das


# When I do a write without any partition it takes 8 min

df2_merge.write.mode('overwrite').parquet(dest_path)

 

       2. I have added conf - spark.sql.sources.partitionOverwriteMode=dynamic 
; it took a longer time (more than 50 min before I force terminated the EMR 
cluster). But I have observed the partitions have been created and data files 
are present. But in EMR cluster the process is still showing as running, where 
as in spark history server it is showing no running or pending process.

df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

 

      3. I have modified with new conf - spark.sql.shuffle.partitions=3; it 
took 24 min

df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

 

     4. Again I disabled the conf and run plain write with partition. It took 
30 min.

df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

 

Only one conf is common in the above scenarios is 
spark.sql.adaptive.coalescePartitions.initialPartitionNum=100

My point is to reduce the time of writing with partitionBy. Is there anything I 
am missing

 

   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200176#comment-17200176
 ] 

Apache Spark commented on SPARK-32659:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29840

> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
>
> DPP has data issue when pruning on non-atomic type. for example:
> {noformat}
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
> spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
> {noformat}
>  It should return two records, but it returns empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32965) pyspark reading csv files with utf_16le encoding

2020-09-22 Thread Punit Shah (Jira)
Punit Shah created SPARK-32965:
--

 Summary: pyspark reading csv files with utf_16le encoding
 Key: SPARK-32965
 URL: https://issues.apache.org/jira/browse/SPARK-32965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 2.4.7
Reporter: Punit Shah


If you have a file encoded in utf_16le or utf_16be and try to use 
spark.read.csv("", encoding="utf_16le") the dataframe isn't rendered 
properly

if you use python decoding like:

prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : 
x.decode("utf_16le").splitlines())

and then do spark.read.csv(prdd), then it works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file

2020-09-22 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200163#comment-17200163
 ] 

Punit Shah commented on SPARK-32956:


That may work

> Duplicate Columns in a csv file
> ---
>
> Key: SPARK-32956
> URL: https://issues.apache.org/jira/browse/SPARK-32956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
>
> Imagine a csv file shaped like:
> 
> Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price
> 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728"
> 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644"
> =
> Reading this with the header=True will result in a stacktrace.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32153) .m2 repository corruption happens

2020-09-22 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200112#comment-17200112
 ] 

Kousuke Saruta edited comment on SPARK-32153 at 9/22/20, 2:31 PM:
--

[~shaneknapp]This issue seems to happen again especially for branch-2.4.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128981/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128976/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/

 Could you help us?|


was (Author: sarutak):
[~shaneknapp]This issue seems to happen again especially for branch-2.4.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128981/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128976/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/
|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/]
Could you help us?

> .m2 repository corruption happens
> -
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28971#issuecomment-652690849] 
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679]
> These can be related to .m2 corruption.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32153) .m2 repository corruption happens

2020-09-22 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32153:
---
Affects Version/s: 2.4.8

> .m2 repository corruption happens
> -
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28971#issuecomment-652690849] 
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679]
> These can be related to .m2 corruption.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16190) Worker registration failed: Duplicate worker ID

2020-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-16190.
---
Fix Version/s: 3.0.0
   Resolution: Duplicate

This is fixed via SPARK-23191 .

Please see [~Ngone51]'s comment, 
https://github.com/apache/spark/pull/29809#issuecomment-696483018 .



> Worker registration failed: Duplicate worker ID
> ---
>
> Key: SPARK-16190
> URL: https://issues.apache.org/jira/browse/SPARK-16190
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: Thomas Huang
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
> app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
> app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
> app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
> "--hostname" "100.65.21.223" "--cores" "5" "--app-id" 
> "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master 
> spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> 

[jira] [Reopened] (SPARK-32153) .m2 repository corruption happens

2020-09-22 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta reopened SPARK-32153:


> .m2 repository corruption happens
> -
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28971#issuecomment-652690849] 
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679]
> These can be related to .m2 corruption.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32153) .m2 repository corruption happens

2020-09-22 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200112#comment-17200112
 ] 

Kousuke Saruta commented on SPARK-32153:


[~shaneknapp]This issue seems to happen again especially for branch-2.4.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128981/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128976/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128966/
|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128982/]
Could you help us?

> .m2 repository corruption happens
> -
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28971#issuecomment-652690849] 
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679]
> These can be related to .m2 corruption.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32153) .m2 repository corruption happens

2020-09-22 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32153:
---
Summary: .m2 repository corruption happens  (was: .m2 repository corruption 
can happen on Jenkins-worker4)

> .m2 repository corruption happens
> -
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28971#issuecomment-652690849] 
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679]
> These can be related to .m2 corruption.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32956) Duplicate Columns in a csv file

2020-09-22 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated SPARK-32956:
---
Component/s: (was: Spark Core)
 SQL

> Duplicate Columns in a csv file
> ---
>
> Key: SPARK-32956
> URL: https://issues.apache.org/jira/browse/SPARK-32956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
>
> Imagine a csv file shaped like:
> 
> Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price
> 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728"
> 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644"
> =
> Reading this with the header=True will result in a stacktrace.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file

2020-09-22 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200092#comment-17200092
 ] 

Chen Zhang commented on SPARK-32956:


In SPARK-16896, if the CSV data has duplicate column headers, put the index as 
the suffix.

 

In this case, _Sale_Amount_ is a duplicate column header. 
 Original column header:
{code:none}
Id, Product, Sale_Amount, Sale_Units, Sale_Amount2, Sale_Amount, 
Sale_Price{code}
Column header after adding index suffix:
{code:none}
Id, Product, Sale_Amount2, Sale_Units, Sale_Amount2, Sale_Amount5, 
Sale_Price{code}
The _Sale_Amount2_ after adding the suffix is still the same as the other 
column header.

 

Maybe we can add the suffix again when we find a new duplicate column header:
{code:none}
Id, Product, Sale_Amount22, Sale_Units, Sale_Amount24, Sale_Amount5, 
Sale_Price{code}

> Duplicate Columns in a csv file
> ---
>
> Key: SPARK-32956
> URL: https://issues.apache.org/jira/browse/SPARK-32956
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
>
> Imagine a csv file shaped like:
> 
> Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price
> 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728"
> 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644"
> =
> Reading this with the header=True will result in a stacktrace.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200079#comment-17200079
 ] 

Apache Spark commented on SPARK-32757:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29839

> Physical InSubqueryExec should be consistent with logical InSubquery
> 
>
> Key: SPARK-32757
> URL: https://issues.apache.org/jira/browse/SPARK-32757
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200063#comment-17200063
 ] 

Apache Spark commented on SPARK-32659:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29838

> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
>
> DPP has data issue when pruning on non-atomic type. for example:
> {noformat}
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
> spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
> {noformat}
>  It should return two records, but it returns empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31882) DAG-viz is not rendered correctly with pagination.

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200035#comment-17200035
 ] 

Apache Spark commented on SPARK-31882:
--

User 'zhli1142015' has created a pull request for this issue:
https://github.com/apache/spark/pull/29833

> DAG-viz is not rendered correctly with pagination.
> --
>
> Key: SPARK-31882
> URL: https://issues.apache.org/jira/browse/SPARK-31882
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Because DAG-viz for a job fetches link urls for each stage from the stage 
> table, rendering can fail with pagination.
> You can reproduce this issue with the following operation.
> {code:java}
>  sc.parallelize(1 to 10).map(value => (value 
> ,value)).repartition(1).repartition(1).repartition(1).reduceByKey(_ + 
> _).collect{code}
> And then, visit the corresponding job page.
> There are 5 stages so show <5 stages in the paged table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31882) DAG-viz is not rendered correctly with pagination.

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200034#comment-17200034
 ] 

Apache Spark commented on SPARK-31882:
--

User 'zhli1142015' has created a pull request for this issue:
https://github.com/apache/spark/pull/29833

> DAG-viz is not rendered correctly with pagination.
> --
>
> Key: SPARK-31882
> URL: https://issues.apache.org/jira/browse/SPARK-31882
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Because DAG-viz for a job fetches link urls for each stage from the stage 
> table, rendering can fail with pagination.
> You can reproduce this issue with the following operation.
> {code:java}
>  sc.parallelize(1 to 10).map(value => (value 
> ,value)).repartition(1).repartition(1).repartition(1).reduceByKey(_ + 
> _).collect{code}
> And then, visit the corresponding job page.
> There are 5 stages so show <5 stages in the paged table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32938) Spark can not cast long value from Kafka

2020-09-22 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200028#comment-17200028
 ] 

Vinod KC commented on SPARK-32938:
--

[~maseiler], 

Can you please test with this example? 
{code:java}
spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
"127.0.0.1:9092").option("subscribe", "longtest").load().withColumn("key", 
conv(hex(col("key")), 16, 10).cast("bigint")).withColumn("value", 
conv(hex(col("value")), 16, 10).cast("bigint")).select("key", 
"value").writeStream.outputMode("update").format("console").start()
{code}

> Spark can not cast long value from Kafka
> 
>
> Key: SPARK-32938
> URL: https://issues.apache.org/jira/browse/SPARK-32938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL, Structured Streaming
>Affects Versions: 3.0.0
> Environment: Debian 10 (Buster), AMD64
> Spark 3.0.0
> Kafka 2.5.0
> spark-sql-kafka-0-10_2.12
>Reporter: Matthias Seiler
>Priority: Major
>
> Spark seems to be unable to cast the key (or value) part from Kafka to a 
> _{color:#172b4d}long{color}_ value and throws
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`key` AS 
> BIGINT)' due to data type mismatch: cannot cast binary to bigint;;{code}
>  
> {color:#172b4d}See this repo for further investigation:{color} 
> [https://github.com/maseiler/spark-kafka- 
> casting-bug|https://github.com/maseiler/spark-kafka-casting-bug]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32925) Support push-based shuffle in multiple deployment environments

2020-09-22 Thread qingwu.fu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200023#comment-17200023
 ] 

qingwu.fu commented on SPARK-32925:
---

Should send data to remote shuffle servioce bypass sort and spill data on local 
node?Because the process of data belongs to same partition gathered to same 
node can  take the place of sort on local node.

 

> Support push-based shuffle in multiple deployment environments
> --
>
> Key: SPARK-32925
> URL: https://issues.apache.org/jira/browse/SPARK-32925
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> Create this ticket outside of SPARK-30602, since this is outside of the scope 
> of the immediate deliverables in that SPIP. Want to use this ticket to 
> discuss more about how to further improve push-based shuffle in different 
> environments.
> The tasks created under SPARK-30602 would enable push-based shuffle on YARN 
> in a compute/storage colocated cluster. However, there are other deployment 
> environments that are getting more popular these days. We have seen 2 as we 
> discussed with other community members on the idea of push-based shuffle:
>  * Spark on K8S in a compute/storage colocated cluster. Because of the 
> limitation of concurrency of read/write of a mounted volume in K8S, multiple 
> executor pods on the same node in a K8S cluster cannot concurrently access 
> the same mounted disk volume. This creates some different requirements for 
> supporting external shuffle service as well as push-based shuffle.
>  * Spark on a compute/storage disaggregate cluster. Such a setup is more 
> typical in cloud environments, where the compute cluster has little/no local 
> storage, and the shuffle intermediate data needs to be stored in remote 
> disaggregate storage cluster.
> Want to use this ticket to discuss ways to support push-based shuffle in 
> these different deployment environments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32463) Document Data Type inference rule in SQL reference

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1721#comment-1721
 ] 

Apache Spark commented on SPARK-32463:
--

User 'planga82' has created a pull request for this issue:
https://github.com/apache/spark/pull/29837

> Document Data Type inference rule in SQL reference
> --
>
> Key: SPARK-32463
> URL: https://issues.apache.org/jira/browse/SPARK-32463
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Document Data Type inference rule in SQL reference, under Data Types section. 
> Please see this PR https://github.com/apache/spark/pull/28896



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32463) Document Data Type inference rule in SQL reference

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32463:


Assignee: Apache Spark

> Document Data Type inference rule in SQL reference
> --
>
> Key: SPARK-32463
> URL: https://issues.apache.org/jira/browse/SPARK-32463
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Document Data Type inference rule in SQL reference, under Data Types section. 
> Please see this PR https://github.com/apache/spark/pull/28896



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32463) Document Data Type inference rule in SQL reference

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32463:


Assignee: (was: Apache Spark)

> Document Data Type inference rule in SQL reference
> --
>
> Key: SPARK-32463
> URL: https://issues.apache.org/jira/browse/SPARK-32463
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Document Data Type inference rule in SQL reference, under Data Types section. 
> Please see this PR https://github.com/apache/spark/pull/28896



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1715#comment-1715
 ] 

Apache Spark commented on SPARK-32964:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29836

> Pass all `streaming` module UTs in Scala 2.13
> -
>
> Key: SPARK-32964
> URL: https://issues.apache.org/jira/browse/SPARK-32964
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Minor
>
> There is only one failed case of `streaming` module in Scala 2.13:
>  * `start with non-serializable DStream checkpoint ` in StreamingContextSuite
> StackOverflowError is thrown here when SerializationDebugger#visit method is 
> called.
> The error stack as follow:
> {code:java}
> Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownExpected exception 
> java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownScalaTestFailureLocation: 
> org.apache.spark.streaming.StreamingContextSuite at 
> (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException:
>  Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrown at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 
> at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
>  at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
> org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
> org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
>  ...Caused by: java.lang.StackOverflowError at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
> sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
>  at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
>  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
> scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
>  at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> 

[jira] [Commented] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1717#comment-1717
 ] 

Apache Spark commented on SPARK-32964:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29836

> Pass all `streaming` module UTs in Scala 2.13
> -
>
> Key: SPARK-32964
> URL: https://issues.apache.org/jira/browse/SPARK-32964
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Minor
>
> There is only one failed case of `streaming` module in Scala 2.13:
>  * `start with non-serializable DStream checkpoint ` in StreamingContextSuite
> StackOverflowError is thrown here when SerializationDebugger#visit method is 
> called.
> The error stack as follow:
> {code:java}
> Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownExpected exception 
> java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownScalaTestFailureLocation: 
> org.apache.spark.streaming.StreamingContextSuite at 
> (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException:
>  Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrown at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 
> at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
>  at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
> org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
> org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
>  ...Caused by: java.lang.StackOverflowError at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
> sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
>  at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
>  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
> scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
>  at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> 

[jira] [Assigned] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32964:


Assignee: (was: Apache Spark)

> Pass all `streaming` module UTs in Scala 2.13
> -
>
> Key: SPARK-32964
> URL: https://issues.apache.org/jira/browse/SPARK-32964
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Minor
>
> There is only one failed case of `streaming` module in Scala 2.13:
>  * `start with non-serializable DStream checkpoint ` in StreamingContextSuite
> StackOverflowError is thrown here when SerializationDebugger#visit method is 
> called.
> The error stack as follow:
> {code:java}
> Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownExpected exception 
> java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownScalaTestFailureLocation: 
> org.apache.spark.streaming.StreamingContextSuite at 
> (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException:
>  Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrown at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 
> at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
>  at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
> org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
> org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
>  ...Caused by: java.lang.StackOverflowError at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
> sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
>  at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
>  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
> scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
>  at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> 

[jira] [Assigned] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32964:


Assignee: Apache Spark

> Pass all `streaming` module UTs in Scala 2.13
> -
>
> Key: SPARK-32964
> URL: https://issues.apache.org/jira/browse/SPARK-32964
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> There is only one failed case of `streaming` module in Scala 2.13:
>  * `start with non-serializable DStream checkpoint ` in StreamingContextSuite
> StackOverflowError is thrown here when SerializationDebugger#visit method is 
> called.
> The error stack as follow:
> {code:java}
> Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownExpected exception 
> java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrownScalaTestFailureLocation: 
> org.apache.spark.streaming.StreamingContextSuite at 
> (StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException:
>  Expected exception java.io.NotSerializableException to be thrown, but 
> java.lang.StackOverflowError was thrown at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 
> at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
>  at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
> org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
> org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
>  ...Caused by: java.lang.StackOverflowError at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
> sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
>  at 
> scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
>  at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
> scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
> scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
>  at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
>  at 
> 

[jira] [Updated] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-32964:
-
Description: 
There is only one failed case of `streaming` module in Scala 2.13:
 * `start with non-serializable DStream checkpoint ` in StreamingContextSuite

StackOverflowError is thrown here when SerializationDebugger#visit method is 
called.

The error stack as follow:
{code:java}
Expected exception java.io.NotSerializableException to be thrown, but 
java.lang.StackOverflowError was thrownExpected exception 
java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError 
was thrownScalaTestFailureLocation: 
org.apache.spark.streaming.StreamingContextSuite at 
(StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: 
Expected exception java.io.NotSerializableException to be thrown, but 
java.lang.StackOverflowError was thrown at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at 
org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
 at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
 ...Caused by: java.lang.StackOverflowError at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
 at 
scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
 at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
 at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
{code}

  was:
There is only one failed case of `streaming` module in Scala 2.13:
 * `start with non-serializable DStream checkpoint ` in StreamingContextSuite

StackOverflowError is thrown here when SerializationDebugger#visit method is 
called.

The error msg as follow:
{code:java}
Expected exception java.io.NotSerializableException to be thrown, but 
java.lang.StackOverflowError was 

[jira] [Updated] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-32964:
-
Description: 
There is only one failed case of `streaming` module in Scala 2.13:
 * `start with non-serializable DStream checkpoint ` in StreamingContextSuite

StackOverflowError is thrown here when SerializationDebugger#visit method is 
called.

The error msg as follow:
{code:java}
Expected exception java.io.NotSerializableException to be thrown, but 
java.lang.StackOverflowError was thrownExpected exception 
java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError 
was thrownScalaTestFailureLocation: 
org.apache.spark.streaming.StreamingContextSuite at 
(StreamingContextSuite.scala:159)org.scalatest.exceptions.TestFailedException: 
Expected exception java.io.NotSerializableException to be thrown, but 
java.lang.StackOverflowError was thrown at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at 
org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1562)
 at org.scalatest.Assertions.intercept(Assertions.scala:756) at 
org.scalatest.Assertions.intercept$(Assertions.scala:746) at 
org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1562) at 
org.apache.spark.streaming.StreamingContextSuite.$anonfun$new$13(StreamingContextSuite.scala:159)
 ...Caused by: java.lang.StackOverflowError at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1397) at 
org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:513) at 
sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1(DefaultSerializationProxy.scala:38)
 at 
scala.collection.generic.DefaultSerializationProxy.$anonfun$writeObject$1$adapted(DefaultSerializationProxy.scala:37)
 at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553) at 
scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551) at 
scala.collection.AbstractIterable.foreach(Iterable.scala:920) at 
scala.collection.generic.DefaultSerializationProxy.writeObject(DefaultSerializationProxy.scala:37)
 at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
 at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
{code}

  was:
There is only one failed case of `streaming` module in Scala 2.13:
 * `start with non-serializable 

[jira] [Created] (SPARK-32964) Pass all `streaming` module UTs in Scala 2.13

2020-09-22 Thread Yang Jie (Jira)
Yang Jie created SPARK-32964:


 Summary: Pass all `streaming` module UTs in Scala 2.13
 Key: SPARK-32964
 URL: https://issues.apache.org/jira/browse/SPARK-32964
 Project: Spark
  Issue Type: Sub-task
  Components: DStreams, Spark Core
Affects Versions: 3.1.0
Reporter: Yang Jie


There is only one failed case of `streaming` module in Scala 2.13:
 * `start with non-serializable DStream checkpoint ` in StreamingContextSuite

StackOverflowError is thrown here when SerializationDebugger#visit method is 
called.

The error msg as follow:
{code:java}
Expected exception java.io.NotSerializableException to be thrown, but 
java.lang.StackOverflowError was thrown
ScalaTestFailureLocation: org.apache.spark.streaming.StreamingContextSuite at 
(StreamingContextSuite.scala:159)
org.scalatest.exceptions.TestFailedException: Expected exception 
java.io.NotSerializableException to be thrown, but java.lang.StackOverflowError 
was thrown

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199953#comment-17199953
 ] 

Apache Spark commented on SPARK-32306:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29835

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32306:


Assignee: (was: Apache Spark)

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32306:


Assignee: Apache Spark

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Assignee: Apache Spark
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199952#comment-17199952
 ] 

Maxim Gekk commented on SPARK-32306:


I opened PR https://github.com/apache/spark/pull/29835 with clarification.

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199933#comment-17199933
 ] 

Apache Spark commented on SPARK-32963:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29834

> empty string should be consistent for schema name in SparkGetSchemasOperation
> -
>
> Key: SPARK-32963
> URL: https://issues.apache.org/jira/browse/SPARK-32963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> When the schema name is empty string, it is considered as ".*" and can match 
> all databases in the catalog.
> But when it can not match the global temp view as it is not converted to ".*"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32963:


Assignee: Apache Spark

> empty string should be consistent for schema name in SparkGetSchemasOperation
> -
>
> Key: SPARK-32963
> URL: https://issues.apache.org/jira/browse/SPARK-32963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> When the schema name is empty string, it is considered as ".*" and can match 
> all databases in the catalog.
> But when it can not match the global temp view as it is not converted to ".*"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation

2020-09-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32963:


Assignee: (was: Apache Spark)

> empty string should be consistent for schema name in SparkGetSchemasOperation
> -
>
> Key: SPARK-32963
> URL: https://issues.apache.org/jira/browse/SPARK-32963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> When the schema name is empty string, it is considered as ".*" and can match 
> all databases in the catalog.
> But when it can not match the global temp view as it is not converted to ".*"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199936#comment-17199936
 ] 

Apache Spark commented on SPARK-32963:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29834

> empty string should be consistent for schema name in SparkGetSchemasOperation
> -
>
> Key: SPARK-32963
> URL: https://issues.apache.org/jira/browse/SPARK-32963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> When the schema name is empty string, it is considered as ".*" and can match 
> all databases in the catalog.
> But when it can not match the global temp view as it is not converted to ".*"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32963) empty string should be consistent for schema name in SparkGetSchemasOperation

2020-09-22 Thread Kent Yao (Jira)
Kent Yao created SPARK-32963:


 Summary: empty string should be consistent for schema name in 
SparkGetSchemasOperation
 Key: SPARK-32963
 URL: https://issues.apache.org/jira/browse/SPARK-32963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Kent Yao


When the schema name is empty string, it is considered as ".*" and can match 
all databases in the catalog.
But when it can not match the global temp view as it is not converted to ".*"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32962) Spark Streaming

2020-09-22 Thread Amit Menashe (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Menashe updated SPARK-32962:
-
Priority: Trivial  (was: Major)

> Spark Streaming
> ---
>
> Key: SPARK-32962
> URL: https://issues.apache.org/jira/browse/SPARK-32962
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Amit Menashe
>Priority: Trivial
>
> Hey there,
> I'm using this spark streaming job which integrated with Kafka (and manage 
> its offsets commitions at Kafka itself),
> The problem is when I have a failure I want to repeat the work on  those 
> offset ranges (that something went wrong with them) , therefore I catch the 
> exception and NOT commit (with commitAsync) this range.
> However I notice it keeps proceeding (without any commit made).
> moreover I removed later all the commitAsync calls and I the stream keep 
> proceeding!
> I guess there might be any inner cache or something that helps the streaming 
> job to consume the entries from Kafka.
>  
> Could you please advice?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32962) Spark Streaming

2020-09-22 Thread Amit Menashe (Jira)
Amit Menashe created SPARK-32962:


 Summary: Spark Streaming
 Key: SPARK-32962
 URL: https://issues.apache.org/jira/browse/SPARK-32962
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.4.5
Reporter: Amit Menashe


Hey there,

I'm using this spark streaming job which integrated with Kafka (and manage its 
offsets commitions at Kafka itself),

The problem is when I have a failure I want to repeat the work on  those offset 
ranges (that something went wrong with them) , therefore I catch the exception 
and NOT commit (with commitAsync) this range.

However I notice it keeps proceeding (without any commit made).

moreover I removed later all the commitAsync calls and I the stream keep 
proceeding!

I guess there might be any inner cache or something that helps the streaming 
job to consume the entries from Kafka.

 

Could you please advice?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32886) '.../jobs/undefined' link from "Event Timeline" in jobs page

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199898#comment-17199898
 ] 

Apache Spark commented on SPARK-32886:
--

User 'zhli1142015' has created a pull request for this issue:
https://github.com/apache/spark/pull/29833

> '.../jobs/undefined' link from "Event Timeline" in jobs page
> 
>
> Key: SPARK-32886
> URL: https://issues.apache.org/jira/browse/SPARK-32886
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Minor
> Fix For: 3.0.2, 3.1.0
>
> Attachments: undefinedlink.JPG
>
>
> In event timeline view of jobs page, clicking job item would redirect you to 
> corresponding job page. when there are two many jobs, some job items' link 
> would redirect to wrong link like '.../jobs/undefined'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32886) '.../jobs/undefined' link from "Event Timeline" in jobs page

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199900#comment-17199900
 ] 

Apache Spark commented on SPARK-32886:
--

User 'zhli1142015' has created a pull request for this issue:
https://github.com/apache/spark/pull/29833

> '.../jobs/undefined' link from "Event Timeline" in jobs page
> 
>
> Key: SPARK-32886
> URL: https://issues.apache.org/jira/browse/SPARK-32886
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Minor
> Fix For: 3.0.2, 3.1.0
>
> Attachments: undefinedlink.JPG
>
>
> In event timeline view of jobs page, clicking job item would redirect you to 
> corresponding job page. when there are two many jobs, some job items' link 
> would redirect to wrong link like '.../jobs/undefined'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32898:
--
Fix Version/s: 2.4.8

> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Linhong Liu
>Assignee: wuyi
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
>  The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
>  However the query shows as SUCCESSful. Maybe this task is speculative. Not 
> sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199846#comment-17199846
 ] 

Sean Malory commented on SPARK-32306:
-

[~maxgekk]; thanks for the definition. Can we please update the docs to state 
that this is how it's being calculated?

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199844#comment-17199844
 ] 

Sean Malory commented on SPARK-32306:
-

Exactly; you should get the median, which is defined, almost universally, as 
the average of the middle two numbers if there are an even number of elements 
in the list.

As you've hinted at, it doesn't really matter. If you decide that the 
percentile should always give you the lower of the two numbers (as it appears 
to do), that's fine, but I think it should be documented as such.

The way this actually came about was me creating a median function and then 
testing that the function was doing the right thing by comparing it with the 
`pandas` equivalent:


{code:python}
import numpy as np
import pandas as pd
import pyspark.sql.functions as psf

median = psf.expr('percentile_approx(val, 0.5, 2147483647)')

xs = np.random.rand(10)
ys = np.random.rand(10)
data = [('foo', float(x)) for x in xs] + [('bar', float(y)) for y in ys]

sparkdf = spark.createDataFrame(data, ['name', 'val'])
spark_meds = sparkdf.groupBy('name').agg(median.alias('median'))

pddf = pd.DataFrame(data, columns=['name', 'val'])
pd_meds = pddf.groupby('name')['val'].median()
{code}

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199839#comment-17199839
 ] 

Maxim Gekk commented on SPARK-32306:


The function returns an element of the input sequence, see 
https://en.wikipedia.org/wiki/Percentile#The_nearest-rank_method

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199840#comment-17199840
 ] 

Apache Spark commented on SPARK-32898:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29832

> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Linhong Liu
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
>  The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
>  However the query shows as SUCCESSful. Maybe this task is speculative. Not 
> sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org