[jira] [Comment Edited] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486268#comment-17486268
 ] 

Sujit Biswas edited comment on SPARK-38061 at 2/3/22, 7:49 AM:
---

also if some of the issues are resolved, how to get the build that has the 
fixes, are the jackson-databind and log4j issue fixed in [Spark 
3.2.1|https://spark.apache.org/releases/spark-release-3-2-1.html]


was (Author: JIRAUSER284395):
also if some of the issues are resolved, how to get the build that has the fixes

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37936) Use error classes in the parsing errors of intervals

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486269#comment-17486269
 ] 

Apache Spark commented on SPARK-37936:
--

User 'senthh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35386

> Use error classes in the parsing errors of intervals
> 
>
> Key: SPARK-37936
> URL: https://issues.apache.org/jira/browse/SPARK-37936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Modify the following methods in QueryParsingErrors:
>  * moreThanOneFromToUnitInIntervalLiteralError
>  * invalidIntervalLiteralError
>  * invalidIntervalFormError
>  * invalidFromToUnitValueError
>  * fromToIntervalUnsupportedError
>  * mixedIntervalUnitsError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486268#comment-17486268
 ] 

Sujit Biswas commented on SPARK-38061:
--

also if some of the issues are resolved, how to get the build that has the fixes

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486264#comment-17486264
 ] 

Sujit Biswas commented on SPARK-38061:
--

not at all helpful, please refer to valid reason why something like this will 
not affect any spark

stop,CRITICAL,false,"Vulnerability found in non-os package type (java) - 
/opt/spark/jars/log4j-1.2.17.jar (GHSA-2qrg-x229-3v8q - 
[https://github.com/advisories/GHSA-2qrg-x229-3v8q] 
)","GHSA-2qrg-x229-3v8q+log4j-1.2.17.jar",package,vulnerabilities

 

 

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

2022-02-02 Thread john (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486263#comment-17486263
 ] 

john commented on SPARK-38058:
--

since i am working in production env i cannot disclose any docs in here. this 
may be bug in spark. it happend every 3/5 times. for 2 times all the records 
are inserted correctly. other times duplicats are inserted. we have tried all 
workarounds it is not working

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -
>
> Key: SPARK-38058
> URL: https://issues.apache.org/jira/browse/SPARK-38058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.1.0
>Reporter: john
>Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486254#comment-17486254
 ] 

Hyukjin Kwon commented on SPARK-38061:
--

No, the security report here simply mentions the issues in their own libraries 
themselves. We don't know if they actually affect Spark or not, and we should 
proceed the upgrade separately for each ticket.

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486246#comment-17486246
 ] 

Sujit Biswas commented on SPARK-38061:
--

info is there in the attachment, you can do that

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38094) Parquet: enable matching schema columns by field id

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38094:


Assignee: (was: Apache Spark)

> Parquet: enable matching schema columns by field id
> ---
>
> Key: SPARK-38094
> URL: https://issues.apache.org/jira/browse/SPARK-38094
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.3
>Reporter: Jackie Zhang
>Priority: Major
>
> Field Id is a native field in the Parquet schema 
> ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])
> After this PR, when the requested schema has field IDs, Parquet readers will 
> first use the field ID to determine which Parquet columns to read, before 
> falling back to using column names as before. It enables matching columns by 
> field id for supported DWs like iceberg and Delta.
> This PR supports:
>  * OSS vectorized reader
> does not support:
>  * Parquet-mr reader due to lack of field id support (needs a follow up 
> ticket)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38094) Parquet: enable matching schema columns by field id

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38094:


Assignee: Apache Spark

> Parquet: enable matching schema columns by field id
> ---
>
> Key: SPARK-38094
> URL: https://issues.apache.org/jira/browse/SPARK-38094
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.3
>Reporter: Jackie Zhang
>Assignee: Apache Spark
>Priority: Major
>
> Field Id is a native field in the Parquet schema 
> ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])
> After this PR, when the requested schema has field IDs, Parquet readers will 
> first use the field ID to determine which Parquet columns to read, before 
> falling back to using column names as before. It enables matching columns by 
> field id for supported DWs like iceberg and Delta.
> This PR supports:
>  * OSS vectorized reader
> does not support:
>  * Parquet-mr reader due to lack of field id support (needs a follow up 
> ticket)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38094) Parquet: enable matching schema columns by field id

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486237#comment-17486237
 ] 

Apache Spark commented on SPARK-38094:
--

User 'jackierwzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35385

> Parquet: enable matching schema columns by field id
> ---
>
> Key: SPARK-38094
> URL: https://issues.apache.org/jira/browse/SPARK-38094
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.3
>Reporter: Jackie Zhang
>Priority: Major
>
> Field Id is a native field in the Parquet schema 
> ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])
> After this PR, when the requested schema has field IDs, Parquet readers will 
> first use the field ID to determine which Parquet columns to read, before 
> falling back to using column names as before. It enables matching columns by 
> field id for supported DWs like iceberg and Delta.
> This PR supports:
>  * OSS vectorized reader
> does not support:
>  * Parquet-mr reader due to lack of field id support (needs a follow up 
> ticket)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486236#comment-17486236
 ] 

Hyukjin Kwon commented on SPARK-38061:
--

[~sujitbiswas] Let's separate a ticket for each. We should identify which 
affect Spark, and upgrade dep one by one instead of doing it in batch with 
pulling unrelated dependency upgrade together.

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Sujit Biswas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486231#comment-17486231
 ] 

Sujit Biswas commented on SPARK-38061:
--

[~hyukjin.kwon] 

note jackson-databind solves only part of the problem, example log4j-1.2.17.jar 
causing critical CVE, there are several other HIGH CVEs, please see the 
attached csv in the bug attachment section

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486219#comment-17486219
 ] 

Hyukjin Kwon commented on SPARK-38073:
--

I think we should fix this .. 

> NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon 
> > 3.7
> ---
>
> Key: SPARK-38073
> URL: https://issues.apache.org/jira/browse/SPARK-38073
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Shell
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with 
> Python >= 3.8, function registered wiht atexit seems to be executed in 
> different scope than in Python 3.7.
> It result in {{NameError: name 'sc' is not defined}} on exit:
> {code:python}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.3.0-SNAPSHOT
>   /_/
> Using Python version 3.8.12 (default, Oct 12 2021 21:57:06)
> Spark context Web UI available at http://192.168.0.198:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1643555855409).
> SparkSession available as 'spark'.
> In [1]:   
>   
> 
> Do you really want to exit ([y]/n)? y
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/path/to/spark/python/pyspark/shell.py", line 49, in 
> atexit.register(lambda: sc.stop())
> NameError: name 'sc' is not defined
> {code}
> This could be easily fixed by capturing `sc` instance
> {code:none}
> diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
> index f0c487877a..4164e3ab0c 100644
> --- a/python/pyspark/shell.py
> +++ b/python/pyspark/shell.py
> @@ -46,7 +46,7 @@ except Exception:
>  
>  sc = spark.sparkContext
>  sql = spark.sql
> -atexit.register(lambda: sc.stop())
> +atexit.register((lambda sc: lambda: sc.stop())(sc))
>  
>  # for compatibility
>  sqlContext = spark._wrapped
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38074) RuntimeError: Java gateway process exited before sending its port number

2022-02-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38074.
--
Resolution: Invalid

> RuntimeError: Java gateway process exited before sending its port number
> 
>
> Key: SPARK-38074
> URL: https://issues.apache.org/jira/browse/SPARK-38074
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, Spark Submit
>Affects Versions: 3.2.1
>Reporter: Malla
>Priority: Major
>
> I am getting 
> RuntimeError: Java gateway process exited before sending its port number when 
> running python tests in  Docker.  
>  
> I am using *spark_home* as /usr/lib/python3.9/site-packages/pyspark
>  
> sc = SparkContext() File 
> "/usr/lib/python3.9/site-packages/pyspark/context.py", line 144, in __init__ 
> SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File
> "/usr/lib/python3.9/site-packages/pyspark/context.py", line 339, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway(conf)
> File "/usr/lib/python3.9/site-packages/pyspark/java_gateway.py", line 136, in 
> launch_gateway raise RuntimeError("Java gateway process exited before sending 
> its port number")
> RuntimeError: Java gateway process exited before sending its port number
>  
> I can provide additional details. Any Help is appreciated



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38074) RuntimeError: Java gateway process exited before sending its port number

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486216#comment-17486216
 ] 

Hyukjin Kwon commented on SPARK-38074:
--

It's very likely your network configuration issue. Would be great to interact 
with mailing list first before filing it as an issue.

> RuntimeError: Java gateway process exited before sending its port number
> 
>
> Key: SPARK-38074
> URL: https://issues.apache.org/jira/browse/SPARK-38074
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, Spark Submit
>Affects Versions: 3.2.1
>Reporter: Malla
>Priority: Major
>
> I am getting 
> RuntimeError: Java gateway process exited before sending its port number when 
> running python tests in  Docker.  
>  
> I am using *spark_home* as /usr/lib/python3.9/site-packages/pyspark
>  
> sc = SparkContext() File 
> "/usr/lib/python3.9/site-packages/pyspark/context.py", line 144, in __init__ 
> SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File
> "/usr/lib/python3.9/site-packages/pyspark/context.py", line 339, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway(conf)
> File "/usr/lib/python3.9/site-packages/pyspark/java_gateway.py", line 136, in 
> launch_gateway raise RuntimeError("Java gateway process exited before sending 
> its port number")
> RuntimeError: Java gateway process exited before sending its port number
>  
> I can provide additional details. Any Help is appreciated



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38074) RuntimeError: Java gateway process exited before sending its port number

2022-02-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38074:
-
Priority: Major  (was: Blocker)

> RuntimeError: Java gateway process exited before sending its port number
> 
>
> Key: SPARK-38074
> URL: https://issues.apache.org/jira/browse/SPARK-38074
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, Spark Submit
>Affects Versions: 3.2.1
>Reporter: Malla
>Priority: Major
>
> I am getting 
> RuntimeError: Java gateway process exited before sending its port number when 
> running python tests in  Docker.  
>  
> I am using *spark_home* as /usr/lib/python3.9/site-packages/pyspark
>  
> sc = SparkContext() File 
> "/usr/lib/python3.9/site-packages/pyspark/context.py", line 144, in __init__ 
> SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File
> "/usr/lib/python3.9/site-packages/pyspark/context.py", line 339, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway(conf)
> File "/usr/lib/python3.9/site-packages/pyspark/java_gateway.py", line 136, in 
> launch_gateway raise RuntimeError("Java gateway process exited before sending 
> its port number")
> RuntimeError: Java gateway process exited before sending its port number
>  
> I can provide additional details. Any Help is appreciated



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38087) select doesnt validate if the column already exists

2022-02-02 Thread Deepa Vasanthkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486215#comment-17486215
 ] 

Deepa Vasanthkumar commented on SPARK-38087:


[~dongjoon]  Thank you, not sure whether this is issue or not.

 

 

> select doesnt validate if the column already exists
> ---
>
> Key: SPARK-38087
> URL: https://issues.apache.org/jira/browse/SPARK-38087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: Version{{{}v3.2.1{}}}
> {{}}
> {{{}{}}}Master{{{}local[*]{}}}
> {{(Reproducible in any environment)}}
>Reporter: Deepa Vasanthkumar
>Priority: Minor
> Attachments: select vs drop.png
>
>
>  
> Select doesnt validate whether the alias column is already present in the 
> dataframe. 
> After which, we cannot do anything in that dataframe on that column. 
> df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception
> df4.show()
>  
> However drop will not let you drop the said column. 
>  
> Scenario to reproduce :
> df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will 
> add same column
> df2.show() 
> df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
> 'firstname' is ambiguous, could be: firstname, firstname.
>  
>  
> Is this expected behavior .
>   !select vs drop.png!
> !image-2022-02-02-06-28-23-543.png!
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38095.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35384
[https://github.com/apache/spark/pull/35384]

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38082) Update minimum numpy version

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486203#comment-17486203
 ] 

Hyukjin Kwon commented on SPARK-38082:
--

Yeah .. we should probably upgrade the minimum version - it's too old.

> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38061.
--
Resolution: Duplicate

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486202#comment-17486202
 ] 

Hyukjin Kwon commented on SPARK-38061:
--

That's already upgraded at SPARK-35550

> security scan issue jackson-databinding HDFS dependency library
> ---
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: scan-security-report-spark-3.2.0-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

2022-02-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486201#comment-17486201
 ] 

Hyukjin Kwon commented on SPARK-38058:
--

spark.speculation has been disabled many years ago so this should not be the 
cause. Did you enable this? It is difficult to debug more without details here. 
do you have more info e.g., logs or Spark UI screenshot, etc? Or are you able 
to reproduce this in other DBMS?

> Writing a spark dataframe to Azure Sql Server is causing duplicate records 
> intermittently
> -
>
> Key: SPARK-38058
> URL: https://issues.apache.org/jira/browse/SPARK-38058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.1.0
>Reporter: john
>Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a 
> table in Azure SQL Server. Below is the code snippet we are using for this 
> insert. However, we noticed on few occasions that some records are being 
> duplicated in the destination table. This is happening for large tables. e.g. 
> if a DataFrame has 600K records, after inserting data into the table, we get 
> around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = 
> "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in 
> distributed fashion, if one of the executors fail in between, they are being 
> re-tried and could be inserting duplicate records. This could be totally 
> meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38066) evaluateEquality should ignore attribute without min/max ColumnStat

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486200#comment-17486200
 ] 

Apache Spark commented on SPARK-38066:
--

User 'Stove-hust' has created a pull request for this issue:
https://github.com/apache/spark/pull/35363

> evaluateEquality should ignore attribute without min/max ColumnStat
> ---
>
> Key: SPARK-38066
> URL: https://issues.apache.org/jira/browse/SPARK-38066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Fencheng Mei
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
>   After opening CBO, when the colStatsMap of a attribute does not have 
> min/max, evaluateEquality method should return None, not 0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38066) evaluateEquality should ignore attribute without min/max ColumnStat

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38066:


Assignee: (was: Apache Spark)

> evaluateEquality should ignore attribute without min/max ColumnStat
> ---
>
> Key: SPARK-38066
> URL: https://issues.apache.org/jira/browse/SPARK-38066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Fencheng Mei
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
>   After opening CBO, when the colStatsMap of a attribute does not have 
> min/max, evaluateEquality method should return None, not 0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38066) evaluateEquality should ignore attribute without min/max ColumnStat

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38066:


Assignee: Apache Spark

> evaluateEquality should ignore attribute without min/max ColumnStat
> ---
>
> Key: SPARK-38066
> URL: https://issues.apache.org/jira/browse/SPARK-38066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Fencheng Mei
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
>   After opening CBO, when the colStatsMap of a attribute does not have 
> min/max, evaluateEquality method should return None, not 0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486181#comment-17486181
 ] 

Apache Spark commented on SPARK-38095:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35384

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38095:
-

Assignee: Dongjoon Hyun

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38095:
-

Assignee: Dongjoon Hyun  (was: Apache Spark)

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38095:


Assignee: (was: Dongjoon Hyun)

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486180#comment-17486180
 ] 

Apache Spark commented on SPARK-38095:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35384

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38095:


Assignee: Apache Spark

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38095:
--
Parent: SPARK-35781
Issue Type: Sub-task  (was: Bug)

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38095:
--
Summary: HistoryServerDiskManager.appStorePath should use backend-based 
extensions  (was: HistoryServerDiskManager.appStorePath should use backend 
extensions)

> HistoryServerDiskManager.appStorePath should use backend-based extensions
> -
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend extensions

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38095:
--
Summary: HistoryServerDiskManager.appStorePath should use backend 
extensions  (was: HistoryServerDiskManager should use backend extensions for 
`apps` directory)

> HistoryServerDiskManager.appStorePath should use backend extensions
> ---
>
> Key: SPARK-38095
> URL: https://issues.apache.org/jira/browse/SPARK-38095
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38095) HistoryServerDiskManager should use backend extensions for `apps` directory

2022-02-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-38095:
-

 Summary: HistoryServerDiskManager should use backend extensions 
for `apps` directory
 Key: SPARK-38095
 URL: https://issues.apache.org/jira/browse/SPARK-38095
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37958) Pyspark SparkContext.AddFile() does not respect spark.files.overwrite

2022-02-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37958.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35377
[https://github.com/apache/spark/pull/35377]

> Pyspark SparkContext.AddFile() does not respect spark.files.overwrite
> -
>
> Key: SPARK-37958
> URL: https://issues.apache.org/jira/browse/SPARK-37958
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Input/Output, Java API
>Affects Versions: 3.1.1
>Reporter: taylor schneider
>Assignee: Leona Yoda
>Priority: Major
> Fix For: 3.3.0
>
>
> I am currently running apache spark 3.1.1. on kubernetes.
> When I try to re-add a file that has already been added I see that the 
> updated file is not actually loaded into the cluster. I see the following 
> warning when calling the addFile() function.
> {code:java}
> 22/01/18 19:05:50 WARN SparkContext: The path 
> http://15.4.12.12:80/demo_data.csv has been added already. Overwriting of 
> added paths is not supported in the current version. {code}
> When I display the dataframe that was loaded I see that the old data is 
> loaded. If I log into the worker pods and delete the file, the same results 
> or observed.
> My SparkConf has the following configurations
> {code:java}
> ('spark.master', 'k8s://https://15.4.7.11:6443')
> ('spark.app.name', 'spark-jupyter-mlib')
> ('spark.submit.deploy.mode', 'cluster')
> ('spark.kubernetes.container.image', 'tschneider/apache-spark-k8:v7')
> ('spark.kubernetes.namespace', 'spark')
> ('spark.kubernetes.pyspark.pythonVersion', '3')
> ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa')
> ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa')
> ('spark.executor.instances', '3')
> ('spark.executor.cores', '2')
> ('spark.executor.memory', '4096m')
> ('spark.executor.memoryOverhead', '1024m')
> ('spark.driver.memory', '1024m')
> ('spark.driver.host', '15.4.12.12')
> ('spark.files.overwrite', 'true')
> ('spark.files.useFetchCache', 'false') {code}
> According to the documentation for 3.1.1. The spark.files.overwrite parameter 
> should in fact load the updated files. The documentation can be found here: 
> [https://spark.apache.org/docs/3.1.1/configuration.html]
> The only workaround is to use a python function to manually delete and 
> re-download the file. Calling addFile still shows the warning in this case. 
> My code for the delete and redownload is as follows:
> {code:java}
> def os_remove(file_path):
>     import socket
>     hostname = socket.gethostname()    action = None
>     import os
>     if os.path.exists(file_path):
>         action = "delete"
>         os.remove(file_path)
>         
>     return (hostname, action)worker_file_path = 
> u"file:///{0}".format(csv_file_name)
> worker_count = int(spark_session.conf.get('spark.executor.instances'))
> rdd = sc.parallelize(range(worker_count)).map(lambda var: 
> os_remove(worker_file_path))
> rdd.collect()
> def download_updated_file(file_url):
>     import urllib.parse as parse
>     file_name = os.path.basename(parse.urlparse(csv_file_url).path)
>     local_file_path = "/{0}".format(file_name)
>     
>     import urllib.request as urllib
>     urllib.urlretrieve(file_url, local_file_path)
>     
> rdd = sc.parallelize(range(worker_count)).map(lambda var: 
> download_updated_file(csv_file_url))
> rdd.collect(){code}
> I believe this is either a bug or a documentation mistake. Perhaps the 
> configuration parameter has a misleading description?
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37958) Pyspark SparkContext.AddFile() does not respect spark.files.overwrite

2022-02-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37958:


Assignee: Leona Yoda

> Pyspark SparkContext.AddFile() does not respect spark.files.overwrite
> -
>
> Key: SPARK-37958
> URL: https://issues.apache.org/jira/browse/SPARK-37958
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Input/Output, Java API
>Affects Versions: 3.1.1
>Reporter: taylor schneider
>Assignee: Leona Yoda
>Priority: Major
>
> I am currently running apache spark 3.1.1. on kubernetes.
> When I try to re-add a file that has already been added I see that the 
> updated file is not actually loaded into the cluster. I see the following 
> warning when calling the addFile() function.
> {code:java}
> 22/01/18 19:05:50 WARN SparkContext: The path 
> http://15.4.12.12:80/demo_data.csv has been added already. Overwriting of 
> added paths is not supported in the current version. {code}
> When I display the dataframe that was loaded I see that the old data is 
> loaded. If I log into the worker pods and delete the file, the same results 
> or observed.
> My SparkConf has the following configurations
> {code:java}
> ('spark.master', 'k8s://https://15.4.7.11:6443')
> ('spark.app.name', 'spark-jupyter-mlib')
> ('spark.submit.deploy.mode', 'cluster')
> ('spark.kubernetes.container.image', 'tschneider/apache-spark-k8:v7')
> ('spark.kubernetes.namespace', 'spark')
> ('spark.kubernetes.pyspark.pythonVersion', '3')
> ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa')
> ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa')
> ('spark.executor.instances', '3')
> ('spark.executor.cores', '2')
> ('spark.executor.memory', '4096m')
> ('spark.executor.memoryOverhead', '1024m')
> ('spark.driver.memory', '1024m')
> ('spark.driver.host', '15.4.12.12')
> ('spark.files.overwrite', 'true')
> ('spark.files.useFetchCache', 'false') {code}
> According to the documentation for 3.1.1. The spark.files.overwrite parameter 
> should in fact load the updated files. The documentation can be found here: 
> [https://spark.apache.org/docs/3.1.1/configuration.html]
> The only workaround is to use a python function to manually delete and 
> re-download the file. Calling addFile still shows the warning in this case. 
> My code for the delete and redownload is as follows:
> {code:java}
> def os_remove(file_path):
>     import socket
>     hostname = socket.gethostname()    action = None
>     import os
>     if os.path.exists(file_path):
>         action = "delete"
>         os.remove(file_path)
>         
>     return (hostname, action)worker_file_path = 
> u"file:///{0}".format(csv_file_name)
> worker_count = int(spark_session.conf.get('spark.executor.instances'))
> rdd = sc.parallelize(range(worker_count)).map(lambda var: 
> os_remove(worker_file_path))
> rdd.collect()
> def download_updated_file(file_url):
>     import urllib.parse as parse
>     file_name = os.path.basename(parse.urlparse(csv_file_url).path)
>     local_file_path = "/{0}".format(file_name)
>     
>     import urllib.request as urllib
>     urllib.urlretrieve(file_url, local_file_path)
>     
> rdd = sc.parallelize(range(worker_count)).map(lambda var: 
> download_updated_file(csv_file_url))
> rdd.collect(){code}
> I believe this is either a bug or a documentation mistake. Perhaps the 
> configuration parameter has a misleading description?
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38094) Parquet: enable matching schema columns by field id

2022-02-02 Thread Jackie Zhang (Jira)
Jackie Zhang created SPARK-38094:


 Summary: Parquet: enable matching schema columns by field id
 Key: SPARK-38094
 URL: https://issues.apache.org/jira/browse/SPARK-38094
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.3
Reporter: Jackie Zhang


Field Id is a native field in the Parquet schema 
([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])

After this PR, when the requested schema has field IDs, Parquet readers will 
first use the field ID to determine which Parquet columns to read, before 
falling back to using column names as before. It enables matching columns by 
field id for supported DWs like iceberg and Delta.

This PR supports:
 * OSS vectorized reader

does not support:
 * Parquet-mr reader due to lack of field id support (needs a follow up ticket)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38087:
--
Fix Version/s: (was: 3.3)

> select doesnt validate if the column already exists
> ---
>
> Key: SPARK-38087
> URL: https://issues.apache.org/jira/browse/SPARK-38087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: Version{{{}v3.2.1{}}}
> {{}}
> {{{}{}}}Master{{{}local[*]{}}}
> {{(Reproducible in any environment)}}
>Reporter: Deepa Vasanthkumar
>Priority: Minor
> Attachments: select vs drop.png
>
>
>  
> Select doesnt validate whether the alias column is already present in the 
> dataframe. 
> After which, we cannot do anything in that dataframe on that column. 
> df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception
> df4.show()
>  
> However drop will not let you drop the said column. 
>  
> Scenario to reproduce :
> df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will 
> add same column
> df2.show() 
> df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
> 'firstname' is ambiguous, could be: firstname, firstname.
>  
>  
> Is this expected behavior .
>   !select vs drop.png!
> !image-2022-02-02-06-28-23-543.png!
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38087) select doesnt validate if the column already exists

2022-02-02 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486167#comment-17486167
 ] 

Dongjoon Hyun commented on SPARK-38087:
---

I removed the fixed version field, [~deepa.vasanthkumar].

> select doesnt validate if the column already exists
> ---
>
> Key: SPARK-38087
> URL: https://issues.apache.org/jira/browse/SPARK-38087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: Version{{{}v3.2.1{}}}
> {{}}
> {{{}{}}}Master{{{}local[*]{}}}
> {{(Reproducible in any environment)}}
>Reporter: Deepa Vasanthkumar
>Priority: Minor
> Attachments: select vs drop.png
>
>
>  
> Select doesnt validate whether the alias column is already present in the 
> dataframe. 
> After which, we cannot do anything in that dataframe on that column. 
> df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception
> df4.show()
>  
> However drop will not let you drop the said column. 
>  
> Scenario to reproduce :
> df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will 
> add same column
> df2.show() 
> df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
> 'firstname' is ambiguous, could be: firstname, firstname.
>  
>  
> Is this expected behavior .
>   !select vs drop.png!
> !image-2022-02-02-06-28-23-543.png!
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30062:
--
Fix Version/s: 3.3.0
   (was: 3.3)

> bug with DB2Driver using mode("overwrite") option("truncate",True)
> --
>
> Key: SPARK-30062
> URL: https://issues.apache.org/jira/browse/SPARK-30062
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Guy Huinen
>Assignee: Ivan Karol
>Priority: Major
>  Labels: db2, pyspark
> Fix For: 3.3.0, 3.2.2
>
>
> using DB2Driver using mode("overwrite") option("truncate",True) gives sql 
> error
>  
> {code:java}
> dfClient.write\
>  .format("jdbc")\
>  .mode("overwrite")\
>  .option('driver', 'com.ibm.db2.jcc.DB2Driver')\
>  .option("url","jdbc:db2://")\
>  .option("user","xxx")\
>  .option("password","")\
>  .option("dbtable","")\
>  .option("truncate",True)\{code}
>  
>  gives the error below
> in summary i belief the semicolon is misplaced or malformated
>  
> {code:java}
> EXPO.EXPO#CMR_STG;IMMEDIATE{code}
>  
>  
> full error
> {code:java}
> An error occurred while calling o47.save. : 
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, 
> SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, 
> DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at 
> com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) 
> at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at 
> com.ibm.db2.jcc.am.kh.d(kh.java:2776) at 
> com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) 
> at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) 
> at com.ibm.db2.jcc.t4.av.h(av.java:124) at 
> com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at 
> com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) 
> at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> 

[jira] [Updated] (SPARK-38089) Show the root cause exception in TestUtils.assertExceptionMsg

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38089:
--
Summary: Show the root cause exception in TestUtils.assertExceptionMsg  
(was: Improve assertion failure message in TestUtils.assertExceptionMsg)

> Show the root cause exception in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0
>
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38089.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35383
[https://github.com/apache/spark/pull/35383]

> Improve assertion failure message in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0
>
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38089) Show the root cause exception in TestUtils.assertExceptionMsg

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38089:
--
Affects Version/s: 3.3.0
   (was: 3.2.1)

> Show the root cause exception in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0
>
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38089:
-

Assignee: Erik Krogen

> Improve assertion failure message in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37996) Contribution guide is stale

2022-02-02 Thread Khalid Mammadov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486139#comment-17486139
 ] 

Khalid Mammadov commented on SPARK-37996:
-

Raised PR: https://github.com/apache/spark-website/pull/378

With following changes:
 * It describes in the Pull request section of the Contributing page the actual 
procedure and takes a contributor through a step by step process.
 * It removes optional "Running tests in your forked repository" section on 
Developer Tools page which is obsolete now and doesn't reflect reality anymore 
i.e. it says we can test by clicking “Run workflow” button which is not 
available anymore as workflow does not use "workflow_dispatch" event trigger 
anymore and was removed in
 * [[SPARK-35048][INFRA] Distribute GitHub Actions workflows to fork 
repositories to share the resources 
spark#32092|https://github.com/apache/spark/pull/32092]
 * Instead it documents the new procedure that above PR introduced i.e. 
contributors needs to use their own GitHub free workflow credits to test new 
changes they are purposing and a Spark Actions workflow will expect that to be 
completed before marking PR to be ready for a review.
 * Some general wording was copied from "Running tests in your forked 
repository" section on Developer Tools page but main content was rewritten to 
meet objective
 * Also fixed URL to developer-tools.html to be resolved by parser (that 
converted it into relative URI) instead of using hard coded absolute URL.

> Contribution guide is stale
> ---
>
> Key: SPARK-37996
> URL: https://issues.apache.org/jira/browse/SPARK-37996
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Khalid Mammadov
>Priority: Minor
>
> Contribution guide mentions below link to use to test on local repo before 
> raising PR but the process has changed and documentation does not reflect it.
> https://spark.apache.org/developer-tools.html#github-workflow-tests
> Only digging into git log of " 
> [.github/workflows/build_and_test.yml|https://github.com/apache/spark/commit/2974b70d1efd4b1c5cfe7e2467766f0a9a1fec82#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2];
>  I managed to find what the new process is. It was changed in 
> [https://github.com/apache/spark/pull/32092] but documentation was not 
> updated.
> I am happy to contribute to fix it but apparently 
> [https://spark.apache.org/developer-tools.html] is hosted in Apache website 
> rather that in the Spark source code



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException

2022-02-02 Thread Ivan Sadikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486107#comment-17486107
 ] 

Ivan Sadikov commented on SPARK-37771:
--

I could not manage to work around the issue with Hadoop 3.3.1 binaries, it 
still persists. Shared prefixes config works; however, I found there are more 
issues with IsolatedClassLoader which might need to be fixed, e.g. the 
incorrect parent class loader is passed to IsolatedClassLoader in certain 
situations - I am debugging this now.

No updates on the fix yet, workaround with the config works, and the issue is 
not blocking me at the moment.

> Race condition in withHiveState and limited logic in IsolatedClientLoader 
> result in ClassNotFoundException
> --
>
> Key: SPARK-37771
> URL: https://issues.apache.org/jira/browse/SPARK-37771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.2, 3.2.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> There is a race condition between creating a Hive client and loading classes 
> that do not appear in shared prefixes config. For example, we confirmed that 
> the code fails for the following configuration:
> {code:java}
> spark.sql.hive.metastore.version 0.13.0
> spark.sql.hive.metastore.jars maven
> spark.sql.hive.metastore.sharedPrefixes  com.amazonaws prefix>
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem{code}
> And code: 
> {code:java}
> -- Prerequisite commands to set up the table
> -- drop table if exists ivan_test_2;
> -- create table ivan_test_2 (a int, part string) using csv location 
> 's3://bucket/hive-test' partitioned by (part);
> -- insert into ivan_test_2 values (1, 'a'); 
> -- Command that triggers failure
> ALTER TABLE ivan_test_2 ADD PARTITION (part='b') LOCATION 
> 's3://bucket/hive-test'{code}
>  
> Stacktrace (line numbers might differ):
> {code:java}
> 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
> org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
> 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
> org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
> 21/12/22 04:37:05 DEBUG IsolatedClientLoader: hive class: 
> com.amazonaws.auth.EnvironmentVariableCredentialsProvider - null
> 21/12/22 04:37:05 ERROR S3AFileSystem: Failed to initialize S3AFileSystem for 
> path s3://bucket/hive-test
> java.io.IOException: From option fs.s3a.aws.credentials.provider 
> java.lang.ClassNotFoundException: Class 
> com.amazonaws.auth.EnvironmentVariableCredentialsProvider not found
>     at 
> org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:725)
>     at 
> org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:688)
>     at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:411)
>     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
>     at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
>     at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
>     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
>     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
>     at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
>     at 
> org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createLocationForAddedPartition(HiveMetaStore.java:1993)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:1865)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:1910)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>     at com.sun.proxy.$Proxy58.add_partitions_req(Unknown Source)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:457)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> 

[jira] [Commented] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time

2022-02-02 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486092#comment-17486092
 ] 

Erik Krogen commented on SPARK-38091:
-

[~Zhen-hao] for formatting you need to use the Atlassian markup: 
[https://jira.atlassian.com/secure/WikiRendererHelpAction.jspa?section=all]
Basically replace ` ... ` with \{{ ... }} and replace ``` ... ``` with \{code} 
... \{code}

> AvroSerializer can cause java.lang.ClassCastException at run time
> -
>
> Key: SPARK-38091
> URL: https://issues.apache.org/jira/browse/SPARK-38091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: Zhenhao Li
>Priority: Major
>  Labels: Avro, serializers
>
> `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% 
> based on the `InternalRow` and `SpecializedGetters` interface. It assumes 
> many implementation details of the interface. 
> For example, in 
> ```scala
>       case (TimestampType, LONG) => avroType.getLogicalType match {
>           // For backward compatibility, if the Avro type is Long and it is 
> not logical type
>           // (the `null` case), output the timestamp value as with 
> millisecond precision.
>           case null | _: TimestampMillis => (getter, ordinal) =>
>             
> DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal)))
>           case _: TimestampMicros => (getter, ordinal) =>
>             timestampRebaseFunc(getter.getLong(ordinal))
>           case other => throw new IncompatibleSchemaException(errorPrefix +
>             s"SQL type ${TimestampType.sql} cannot be converted to Avro 
> logical type $other")
>         }
> ```
> it assumes the `InternalRow` instance encodes `TimestampType` as 
> `java.lang.Long`. That's true for `Unsaferow` but not for 
> `GenericInternalRow`. 
> Hence the above code will end up with runtime exceptions when used on an 
> instance of `GenericInternalRow`, which is the case for Python UDF. 
> I didn't get time to dig deeper than that. I got the impression that Spark's 
> optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't 
> involve the optimizer(s) and hence each row is a `GenericInternalRow`. 
> It would be great if someone can correct me or offer a better explanation. 
>  
> To reproduce the issue, 
> `git checkout master` and `git cherry-pick --no-commit` 
> [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88]
> and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`.
>  
> You will see runtime exceptions like the following one
> ```
> - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED ***
>   java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to 
> class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module 
> java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in 
> unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217)
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38093) Set shuffleMergeAllowed to false for a determinate stage after the stage is finalized

2022-02-02 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-38093:
---

 Summary: Set shuffleMergeAllowed to false for a determinate stage 
after the stage is finalized
 Key: SPARK-38093
 URL: https://issues.apache.org/jira/browse/SPARK-38093
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.2.1
Reporter: Venkata krishnan Sowrirajan


Currently we are setting shuffleMergeAllowed to false before 
prepareShuffleServicesForShuffleMapStage if the shuffle dependency is already 
finalized. Ideally it is better to do it right after shuffle dependency 
finalization for a determinate stage. cc [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38092) Check if shuffleMergeId is the same as the current stage's shuffleMergeId before registering MergeStatus

2022-02-02 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-38092:
---

 Summary: Check if shuffleMergeId is the same as the current 
stage's shuffleMergeId before registering MergeStatus
 Key: SPARK-38092
 URL: https://issues.apache.org/jira/browse/SPARK-38092
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.2.1
Reporter: Venkata krishnan Sowrirajan


Currently we have handled this in the handleShuffleMergeFinalized during 
finalization ensuring the finalize request is indeed for the current stage's 
shuffle dependency shuffleMergeId. The same check has to be done before 
registering merge statuses as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38047) Add OUTLIER_NO_FALLBACK executor roll policy

2022-02-02 Thread Alex Holmes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Holmes updated SPARK-38047:

Description: 
Currently executor rolling will always kill one executor every 
{{{}spark.kubernetes.executor.rollInterval{}}}. This may not be optimal in 
cases where the executor metric isn't an outlier compared to other executors. 
There is a cost associated with killing executors (ramp-up time for new 
executors for example) which applications may not want to incur for non-outlier 
executors.

 

This ticket would add the ability to only kill executors if they are outliners 
via the introduction of a new roll policy.

  was:
Currently executor rolling will always kill one executor every 
{{{}spark.kubernetes.executor.rollInterval{}}}. For some of the policies this 
may not be optimal in cases where the executor metric isn't an outlier compared 
to other executors. There is a cost associated with killing executors (ramp-up 
time for new executors for example) which applications may not want to incur 
for non-outlier executors.

 

This ticket would add the ability to only kill executors if they are outliners.


> Add OUTLIER_NO_FALLBACK executor roll policy
> 
>
> Key: SPARK-38047
> URL: https://issues.apache.org/jira/browse/SPARK-38047
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Alex Holmes
>Assignee: Alex Holmes
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently executor rolling will always kill one executor every 
> {{{}spark.kubernetes.executor.rollInterval{}}}. This may not be optimal in 
> cases where the executor metric isn't an outlier compared to other executors. 
> There is a cost associated with killing executors (ramp-up time for new 
> executors for example) which applications may not want to incur for 
> non-outlier executors.
>  
> This ticket would add the ability to only kill executors if they are 
> outliners via the introduction of a new roll policy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486041#comment-17486041
 ] 

Apache Spark commented on SPARK-38091:
--

User 'Zhen-hao' has created a pull request for this issue:
https://github.com/apache/spark/pull/35379

> AvroSerializer can cause java.lang.ClassCastException at run time
> -
>
> Key: SPARK-38091
> URL: https://issues.apache.org/jira/browse/SPARK-38091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: Zhenhao Li
>Priority: Major
>  Labels: Avro, serializers
>
> `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% 
> based on the `InternalRow` and `SpecializedGetters` interface. It assumes 
> many implementation details of the interface. 
> For example, in 
> ```scala
>       case (TimestampType, LONG) => avroType.getLogicalType match {
>           // For backward compatibility, if the Avro type is Long and it is 
> not logical type
>           // (the `null` case), output the timestamp value as with 
> millisecond precision.
>           case null | _: TimestampMillis => (getter, ordinal) =>
>             
> DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal)))
>           case _: TimestampMicros => (getter, ordinal) =>
>             timestampRebaseFunc(getter.getLong(ordinal))
>           case other => throw new IncompatibleSchemaException(errorPrefix +
>             s"SQL type ${TimestampType.sql} cannot be converted to Avro 
> logical type $other")
>         }
> ```
> it assumes the `InternalRow` instance encodes `TimestampType` as 
> `java.lang.Long`. That's true for `Unsaferow` but not for 
> `GenericInternalRow`. 
> Hence the above code will end up with runtime exceptions when used on an 
> instance of `GenericInternalRow`, which is the case for Python UDF. 
> I didn't get time to dig deeper than that. I got the impression that Spark's 
> optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't 
> involve the optimizer(s) and hence each row is a `GenericInternalRow`. 
> It would be great if someone can correct me or offer a better explanation. 
>  
> To reproduce the issue, 
> `git checkout master` and `git cherry-pick --no-commit` 
> [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88]
> and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`.
>  
> You will see runtime exceptions like the following one
> ```
> - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED ***
>   java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to 
> class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module 
> java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in 
> unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217)
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38091:


Assignee: Apache Spark

> AvroSerializer can cause java.lang.ClassCastException at run time
> -
>
> Key: SPARK-38091
> URL: https://issues.apache.org/jira/browse/SPARK-38091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: Zhenhao Li
>Assignee: Apache Spark
>Priority: Major
>  Labels: Avro, serializers
>
> `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% 
> based on the `InternalRow` and `SpecializedGetters` interface. It assumes 
> many implementation details of the interface. 
> For example, in 
> ```scala
>       case (TimestampType, LONG) => avroType.getLogicalType match {
>           // For backward compatibility, if the Avro type is Long and it is 
> not logical type
>           // (the `null` case), output the timestamp value as with 
> millisecond precision.
>           case null | _: TimestampMillis => (getter, ordinal) =>
>             
> DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal)))
>           case _: TimestampMicros => (getter, ordinal) =>
>             timestampRebaseFunc(getter.getLong(ordinal))
>           case other => throw new IncompatibleSchemaException(errorPrefix +
>             s"SQL type ${TimestampType.sql} cannot be converted to Avro 
> logical type $other")
>         }
> ```
> it assumes the `InternalRow` instance encodes `TimestampType` as 
> `java.lang.Long`. That's true for `Unsaferow` but not for 
> `GenericInternalRow`. 
> Hence the above code will end up with runtime exceptions when used on an 
> instance of `GenericInternalRow`, which is the case for Python UDF. 
> I didn't get time to dig deeper than that. I got the impression that Spark's 
> optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't 
> involve the optimizer(s) and hence each row is a `GenericInternalRow`. 
> It would be great if someone can correct me or offer a better explanation. 
>  
> To reproduce the issue, 
> `git checkout master` and `git cherry-pick --no-commit` 
> [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88]
> and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`.
>  
> You will see runtime exceptions like the following one
> ```
> - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED ***
>   java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to 
> class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module 
> java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in 
> unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217)
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38091:


Assignee: (was: Apache Spark)

> AvroSerializer can cause java.lang.ClassCastException at run time
> -
>
> Key: SPARK-38091
> URL: https://issues.apache.org/jira/browse/SPARK-38091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: Zhenhao Li
>Priority: Major
>  Labels: Avro, serializers
>
> `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% 
> based on the `InternalRow` and `SpecializedGetters` interface. It assumes 
> many implementation details of the interface. 
> For example, in 
> ```scala
>       case (TimestampType, LONG) => avroType.getLogicalType match {
>           // For backward compatibility, if the Avro type is Long and it is 
> not logical type
>           // (the `null` case), output the timestamp value as with 
> millisecond precision.
>           case null | _: TimestampMillis => (getter, ordinal) =>
>             
> DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal)))
>           case _: TimestampMicros => (getter, ordinal) =>
>             timestampRebaseFunc(getter.getLong(ordinal))
>           case other => throw new IncompatibleSchemaException(errorPrefix +
>             s"SQL type ${TimestampType.sql} cannot be converted to Avro 
> logical type $other")
>         }
> ```
> it assumes the `InternalRow` instance encodes `TimestampType` as 
> `java.lang.Long`. That's true for `Unsaferow` but not for 
> `GenericInternalRow`. 
> Hence the above code will end up with runtime exceptions when used on an 
> instance of `GenericInternalRow`, which is the case for Python UDF. 
> I didn't get time to dig deeper than that. I got the impression that Spark's 
> optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't 
> involve the optimizer(s) and hence each row is a `GenericInternalRow`. 
> It would be great if someone can correct me or offer a better explanation. 
>  
> To reproduce the issue, 
> `git checkout master` and `git cherry-pick --no-commit` 
> [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88]
> and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`.
>  
> You will see runtime exceptions like the following one
> ```
> - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED ***
>   java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to 
> class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module 
> java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in 
> unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217)
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time

2022-02-02 Thread Zhenhao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486036#comment-17486036
 ] 

Zhenhao Li commented on SPARK-38091:


Can someone tell me how to let Jira render markdown? 

> AvroSerializer can cause java.lang.ClassCastException at run time
> -
>
> Key: SPARK-38091
> URL: https://issues.apache.org/jira/browse/SPARK-38091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: Zhenhao Li
>Priority: Major
>  Labels: Avro, serializers
>
> `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% 
> based on the `InternalRow` and `SpecializedGetters` interface. It assumes 
> many implementation details of the interface. 
> For example, in 
> ```scala
>       case (TimestampType, LONG) => avroType.getLogicalType match {
>           // For backward compatibility, if the Avro type is Long and it is 
> not logical type
>           // (the `null` case), output the timestamp value as with 
> millisecond precision.
>           case null | _: TimestampMillis => (getter, ordinal) =>
>             
> DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal)))
>           case _: TimestampMicros => (getter, ordinal) =>
>             timestampRebaseFunc(getter.getLong(ordinal))
>           case other => throw new IncompatibleSchemaException(errorPrefix +
>             s"SQL type ${TimestampType.sql} cannot be converted to Avro 
> logical type $other")
>         }
> ```
> it assumes the `InternalRow` instance encodes `TimestampType` as 
> `java.lang.Long`. That's true for `Unsaferow` but not for 
> `GenericInternalRow`. 
> Hence the above code will end up with runtime exceptions when used on an 
> instance of `GenericInternalRow`, which is the case for Python UDF. 
> I didn't get time to dig deeper than that. I got the impression that Spark's 
> optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't 
> involve the optimizer(s) and hence each row is a `GenericInternalRow`. 
> It would be great if someone can correct me or offer a better explanation. 
>  
> To reproduce the issue, 
> `git checkout master` and `git cherry-pick --no-commit` 
> [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88]
> and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`.
>  
> You will see runtime exceptions like the following one
> ```
> - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED ***
>   java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to 
> class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module 
> java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in 
> unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67)
>   at 
> org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217)
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time

2022-02-02 Thread Zhenhao Li (Jira)
Zhenhao Li created SPARK-38091:
--

 Summary: AvroSerializer can cause java.lang.ClassCastException at 
run time
 Key: SPARK-38091
 URL: https://issues.apache.org/jira/browse/SPARK-38091
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 3.0.1, 
3.0.0
Reporter: Zhenhao Li


`AvroSerializer`'s implementation, at least in `newConverter`, was not 100% 
based on the `InternalRow` and `SpecializedGetters` interface. It assumes many 
implementation details of the interface. 

For example, in 

```scala
      case (TimestampType, LONG) => avroType.getLogicalType match {
          // For backward compatibility, if the Avro type is Long and it is not 
logical type
          // (the `null` case), output the timestamp value as with millisecond 
precision.
          case null | _: TimestampMillis => (getter, ordinal) =>
            
DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal)))
          case _: TimestampMicros => (getter, ordinal) =>
            timestampRebaseFunc(getter.getLong(ordinal))
          case other => throw new IncompatibleSchemaException(errorPrefix +
            s"SQL type ${TimestampType.sql} cannot be converted to Avro logical 
type $other")
        }
```

it assumes the `InternalRow` instance encodes `TimestampType` as 
`java.lang.Long`. That's true for `Unsaferow` but not for `GenericInternalRow`. 

Hence the above code will end up with runtime exceptions when used on an 
instance of `GenericInternalRow`, which is the case for Python UDF. 

I didn't get time to dig deeper than that. I got the impression that Spark's 
optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't involve 
the optimizer(s) and hence each row is a `GenericInternalRow`. 

It would be great if someone can correct me or offer a better explanation. 

 

To reproduce the issue, 

`git checkout master` and `git cherry-pick --no-commit` 
[this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88]

and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`.

 

You will see runtime exceptions like the following one

```

- Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED ***
  java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to 
class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module 
java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in 
unnamed module of loader 'app')
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195)
  at 
org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136)
  at 
org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135)
  at 
org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283)
  at org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60)
  at 
org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82)
  at 
org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67)
  at 
org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217)
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38089:


Assignee: (was: Apache Spark)

> Improve assertion failure message in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.1
>Reporter: Erik Krogen
>Priority: Major
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486025#comment-17486025
 ] 

Apache Spark commented on SPARK-38089:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35383

> Improve assertion failure message in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.1
>Reporter: Erik Krogen
>Priority: Major
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486027#comment-17486027
 ] 

Apache Spark commented on SPARK-38089:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35383

> Improve assertion failure message in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.1
>Reporter: Erik Krogen
>Priority: Major
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg

2022-02-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38089:


Assignee: Apache Spark

> Improve assertion failure message in TestUtils.assertExceptionMsg
> -
>
> Key: SPARK-38089
> URL: https://issues.apache.org/jira/browse/SPARK-38089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.1
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
> match, it can be challenging to tell why, because the exception tree that was 
> searched isn't printed. Only way I could find to fix it up was to run things 
> in a debugger and check the exception tree.
> It would be very helpful if {{assertExceptionMsg}} printed out the exception 
> tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38090) Make links for stderr/stdout on Spark on Kube configurable

2022-02-02 Thread Holden Karau (Jira)
Holden Karau created SPARK-38090:


 Summary: Make links for stderr/stdout on Spark on Kube configurable
 Key: SPARK-38090
 URL: https://issues.apache.org/jira/browse/SPARK-38090
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.3.0, 3.2.2
Reporter: Holden Karau
Assignee: Holden Karau


Unlike YARN different clusters store pod logs in different locations. We should 
allow people to configure the links so that they can go a web UI for their 
clusters stderr/stdout or print out the kubectl commands for users who don't 
have a link configured.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg

2022-02-02 Thread Erik Krogen (Jira)
Erik Krogen created SPARK-38089:
---

 Summary: Improve assertion failure message in 
TestUtils.assertExceptionMsg
 Key: SPARK-38089
 URL: https://issues.apache.org/jira/browse/SPARK-38089
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Tests
Affects Versions: 3.2.1
Reporter: Erik Krogen


{{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ 
match, it can be challenging to tell why, because the exception tree that was 
searched isn't printed. Only way I could find to fix it up was to run things in 
a debugger and check the exception tree.
It would be very helpful if {{assertExceptionMsg}} printed out the exception 
tree in which it was searching (upon failure).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37145) Improvement for extending pod feature steps with KubernetesConf

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37145.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35345
[https://github.com/apache/spark/pull/35345]

> Improvement for extending pod feature steps with KubernetesConf
> ---
>
> Key: SPARK-37145
> URL: https://issues.apache.org/jira/browse/SPARK-37145
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: wangxin201492
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> SPARK-33261 provides us with great convenience, but it only construct a 
> `KubernetesFeatureConfigStep` with a empty construction method.
> It would be better to use the construction method with `KubernetesConf` (or 
> more detail: `KubernetesDriverConf` and `KubernetesExecutorConf`)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37145) Improvement for extending pod feature steps with KubernetesConf

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37145:
-

Assignee: Yikun Jiang

> Improvement for extending pod feature steps with KubernetesConf
> ---
>
> Key: SPARK-37145
> URL: https://issues.apache.org/jira/browse/SPARK-37145
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: wangxin201492
>Assignee: Yikun Jiang
>Priority: Major
>
> SPARK-33261 provides us with great convenience, but it only construct a 
> `KubernetesFeatureConfigStep` with a empty construction method.
> It would be better to use the construction method with `KubernetesConf` (or 
> more detail: `KubernetesDriverConf` and `KubernetesExecutorConf`)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37145) Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer API

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37145:
--
Summary: Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer 
API  (was: Improvement for extending pod feature steps with KubernetesConf)

> Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer API
> 
>
> Key: SPARK-37145
> URL: https://issues.apache.org/jira/browse/SPARK-37145
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: wangxin201492
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> SPARK-33261 provides us with great convenience, but it only construct a 
> `KubernetesFeatureConfigStep` with a empty construction method.
> It would be better to use the construction method with `KubernetesConf` (or 
> more detail: `KubernetesDriverConf` and `KubernetesExecutorConf`)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException

2022-02-02 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485973#comment-17485973
 ] 

Steve Loughran commented on SPARK-37771:


[~ivan.sadikov] -any update here?

> Race condition in withHiveState and limited logic in IsolatedClientLoader 
> result in ClassNotFoundException
> --
>
> Key: SPARK-37771
> URL: https://issues.apache.org/jira/browse/SPARK-37771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.2, 3.2.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> There is a race condition between creating a Hive client and loading classes 
> that do not appear in shared prefixes config. For example, we confirmed that 
> the code fails for the following configuration:
> {code:java}
> spark.sql.hive.metastore.version 0.13.0
> spark.sql.hive.metastore.jars maven
> spark.sql.hive.metastore.sharedPrefixes  com.amazonaws prefix>
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem{code}
> And code: 
> {code:java}
> -- Prerequisite commands to set up the table
> -- drop table if exists ivan_test_2;
> -- create table ivan_test_2 (a int, part string) using csv location 
> 's3://bucket/hive-test' partitioned by (part);
> -- insert into ivan_test_2 values (1, 'a'); 
> -- Command that triggers failure
> ALTER TABLE ivan_test_2 ADD PARTITION (part='b') LOCATION 
> 's3://bucket/hive-test'{code}
>  
> Stacktrace (line numbers might differ):
> {code:java}
> 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
> org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
> 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
> org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
> 21/12/22 04:37:05 DEBUG IsolatedClientLoader: hive class: 
> com.amazonaws.auth.EnvironmentVariableCredentialsProvider - null
> 21/12/22 04:37:05 ERROR S3AFileSystem: Failed to initialize S3AFileSystem for 
> path s3://bucket/hive-test
> java.io.IOException: From option fs.s3a.aws.credentials.provider 
> java.lang.ClassNotFoundException: Class 
> com.amazonaws.auth.EnvironmentVariableCredentialsProvider not found
>     at 
> org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:725)
>     at 
> org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:688)
>     at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:411)
>     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
>     at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
>     at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
>     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
>     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
>     at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
>     at 
> org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createLocationForAddedPartition(HiveMetaStore.java:1993)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:1865)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:1910)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>     at com.sun.proxy.$Proxy58.add_partitions_req(Unknown Source)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:457)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>     at com.sun.proxy.$Proxy59.add_partitions(Unknown Source)
>     at 
> org.apache.hadoop.hive.ql.metadata.Hive.createPartitions(Hive.java:1514)
>     at 
> org.apache.spark.sql.hive.client.Shim_v0_13.createPartitions(HiveShim.scala:773)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createPartitions$1(HiveClientImpl.scala:683)
>     at 

[jira] [Commented] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs

2022-02-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485917#comment-17485917
 ] 

Apache Spark commented on SPARK-28090:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/35382

> Spark hangs when an execution plan has many projections on nested structs
> -
>
> Key: SPARK-28090
> URL: https://issues.apache.org/jira/browse/SPARK-28090
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.4.3
> Environment: Tried in
>  * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows
>  * Spark 2.4.3 / Yarn on a Linux cluster
>Reporter: Ruslan Yushchenko
>Priority: Major
>  Labels: bulk-closed
>
> This was already posted (#28016), but the provided example didn't always 
> reproduce the error. This example consistently reproduces the issue.
> Spark applications freeze on execution plan optimization stage (Catalyst) 
> when a logical execution plan contains a lot of projections that operate on 
> nested struct fields.
> The code listed below demonstrates the issue.
> To reproduce the Spark App does the following:
>  * A small dataframe is created from a JSON example.
>  * Several nested transformations (negation of a number) are applied on 
> struct fields and each time a new struct field is created. 
>  * Once more than 9 such transformations are applied the Catalyst optimizer 
> freezes on optimizing the execution plan.
>  * You can control the freezing by choosing different upper bound for the 
> Range. E.g. it will work file if the upper bound is 5, but will hang is the 
> bound is 10.
> {code:java}
> package com.example
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types.{StructField, StructType}
> import scala.collection.mutable.ListBuffer
> object SparkApp1IssueSelfContained {
>   // A sample data for a dataframe with nested structs
>   val sample: List[String] =
> """ { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, 
> "num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": 
> 110, "num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } 
> """ ::
> """ { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, 
> "num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": 
> 210, "num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } 
> """ ::
> """ { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, 
> "num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": 
> 310, "num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } 
> """ ::
> Nil
>   /**
> * Transforms a column inside a nested struct. The transformed value will 
> be put into a new field of that nested struct
> *
> * The output column name can omit the full path as the field will be 
> created at the same level of nesting as the input column.
> *
> * @param inputColumnName  A column name for which to apply the 
> transformation, e.g. `company.employee.firstName`.
> * @param outputColumnName The output column name. The path is optional, 
> e.g. you can use `transformedName` instead of 
> `company.employee.transformedName`.
> * @param expression   A function that applies a transformation to a 
> column as a Spark expression.
> * @return A dataframe with a new field that contains transformed values.
> */
>   def transformInsideNestedStruct(df: DataFrame,
>   inputColumnName: String,
>   outputColumnName: String,
>   expression: Column => Column): DataFrame = {
> def mapStruct(schema: StructType, path: Seq[String], parentColumn: 
> Option[Column] = None): Seq[Column] = {
>   val mappedFields = new ListBuffer[Column]()
>   def handleMatchedLeaf(field: StructField, curColumn: Column): 
> Seq[Column] = {
> val newColumn = expression(curColumn).as(outputColumnName)
> mappedFields += newColumn
> Seq(curColumn)
>   }
>   def handleMatchedNonLeaf(field: StructField, curColumn: Column): 
> Seq[Column] = {
> // Non-leaf columns need to be further processed recursively
> field.dataType match {
>   case dt: StructType => Seq(struct(mapStruct(dt, path.tail, 
> Some(curColumn)): _*).as(field.name))
>   case _ => throw new IllegalArgumentException(s"Field 
> '${field.name}' is not a struct type.")
> }
>   }
>   val fieldName = path.head
>   val isLeaf = path.lengthCompare(2) < 0
>   val newColumns = schema.fields.flatMap(field => {
> // 

[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error

2022-02-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michał Wieleba updated SPARK-38088:
---
Attachment: (was: image-2022-02-02-10-08-51-144.png)

> Kryo DataWritingSparkTaskResult registration error
> --
>
> Key: SPARK-38088
> URL: https://issues.apache.org/jira/browse/SPARK-38088
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Michał Wieleba
>Priority: Major
> Attachments: image-2022-02-02-10-09-14-858.png
>
>
> Spark 3.1.2, Scala 2.12
> I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ 
> method. Inside there are spark structured streaming code. Following settings 
> are added as well:
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.set("spark.kryo.registrationRequired", "true")
> Unfortunately, during execution following error is thrown:
> Caused by: java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
> Note: To register this class use: 
> kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
>  
> As far as I can see in 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
>  
> class 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is 
> private (private[v2] case class DataWritingSparkTaskResult) therefore not 
> available to register.
>  
> !image-2022-02-02-10-09-14-858.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error

2022-02-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michał Wieleba updated SPARK-38088:
---
Description: 
Spark 3.1.2, Scala 2.12

I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. 
Inside there are spark structured streaming code. Following settings are added 
as well:

sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrationRequired", "true")

Unfortunately, during execution following error is thrown:

Caused by: java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
Note: To register this class use: 
kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
 

As far as I can see in 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
 

class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult 
is private (private[v2] case class DataWritingSparkTaskResult) therefore not 
available to register.

 

!image-2022-02-02-10-09-14-858.png!

 

  was:
Spark 3.1.2, Scala 2.12

I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. 
Inside there are spark structured streaming code. Following settings are added 
as well:

sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrationRequired", "true")

Unfortunately, during execution following error is thrown:

Caused by: java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
Note: To register this class use: 
kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
 

As far as I can see in 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
 

class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult 
is private (private[v2] case class DataWritingSparkTaskResult) therefore not 
available to register.

 

!image-2022-02-02-10-08-51-144.png!

 


> Kryo DataWritingSparkTaskResult registration error
> --
>
> Key: SPARK-38088
> URL: https://issues.apache.org/jira/browse/SPARK-38088
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Michał Wieleba
>Priority: Major
> Attachments: image-2022-02-02-10-08-51-144.png, 
> image-2022-02-02-10-09-14-858.png
>
>
> Spark 3.1.2, Scala 2.12
> I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ 
> method. Inside there are spark structured streaming code. Following settings 
> are added as well:
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.set("spark.kryo.registrationRequired", "true")
> Unfortunately, during execution following error is thrown:
> Caused by: java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
> Note: To register this class use: 
> kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
>  
> As far as I can see in 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
>  
> class 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is 
> private (private[v2] case class DataWritingSparkTaskResult) therefore not 
> available to register.
>  
> !image-2022-02-02-10-09-14-858.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error

2022-02-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michał Wieleba updated SPARK-38088:
---
Attachment: image-2022-02-02-10-09-14-858.png

> Kryo DataWritingSparkTaskResult registration error
> --
>
> Key: SPARK-38088
> URL: https://issues.apache.org/jira/browse/SPARK-38088
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Michał Wieleba
>Priority: Major
> Attachments: image-2022-02-02-10-08-51-144.png, 
> image-2022-02-02-10-09-14-858.png
>
>
> Spark 3.1.2, Scala 2.12
> I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ 
> method. Inside there are spark structured streaming code. Following settings 
> are added as well:
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.set("spark.kryo.registrationRequired", "true")
> Unfortunately, during execution following error is thrown:
> Caused by: java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
> Note: To register this class use: 
> kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
>  
> As far as I can see in 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
>  
> class 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is 
> private (private[v2] case class DataWritingSparkTaskResult) therefore not 
> available to register.
>  
> !image-2022-02-02-10-08-51-144.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error

2022-02-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michał Wieleba updated SPARK-38088:
---
Attachment: image-2022-02-02-10-08-51-144.png

> Kryo DataWritingSparkTaskResult registration error
> --
>
> Key: SPARK-38088
> URL: https://issues.apache.org/jira/browse/SPARK-38088
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Michał Wieleba
>Priority: Major
> Attachments: image-2022-02-02-10-08-51-144.png
>
>
> Spark 3.1.2, Scala 2.12
> I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ 
> method. Inside there are spark structured streaming code. Following settings 
> are added as well:
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.set("spark.kryo.registrationRequired", "true")
> Unfortunately, during execution following error is thrown:
> Caused by: java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
> Note: To register this class use: 
> kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
>  
> As far as I can see in 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
>  
> class 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is 
> private (private[v2] case class DataWritingSparkTaskResult) therefore not 
> available to register.
>  
> !image-2022-02-02-10-06-32-342.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error

2022-02-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michał Wieleba updated SPARK-38088:
---
Description: 
Spark 3.1.2, Scala 2.12

I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. 
Inside there are spark structured streaming code. Following settings are added 
as well:

sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrationRequired", "true")

Unfortunately, during execution following error is thrown:

Caused by: java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
Note: To register this class use: 
kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
 

As far as I can see in 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
 

class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult 
is private (private[v2] case class DataWritingSparkTaskResult) therefore not 
available to register.

 

!image-2022-02-02-10-08-51-144.png!

 

  was:
Spark 3.1.2, Scala 2.12

I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. 
Inside there are spark structured streaming code. Following settings are added 
as well:

sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrationRequired", "true")

Unfortunately, during execution following error is thrown:

Caused by: java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
Note: To register this class use: 
kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
 

As far as I can see in 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
 

class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult 
is private (private[v2] case class DataWritingSparkTaskResult) therefore not 
available to register.

 

!image-2022-02-02-10-06-32-342.png!

 


> Kryo DataWritingSparkTaskResult registration error
> --
>
> Key: SPARK-38088
> URL: https://issues.apache.org/jira/browse/SPARK-38088
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Michał Wieleba
>Priority: Major
> Attachments: image-2022-02-02-10-08-51-144.png
>
>
> Spark 3.1.2, Scala 2.12
> I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ 
> method. Inside there are spark structured streaming code. Following settings 
> are added as well:
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.set("spark.kryo.registrationRequired", "true")
> Unfortunately, during execution following error is thrown:
> Caused by: java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
> Note: To register this class use: 
> kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
>  
> As far as I can see in 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
>  
> class 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is 
> private (private[v2] case class DataWritingSparkTaskResult) therefore not 
> available to register.
>  
> !image-2022-02-02-10-08-51-144.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error

2022-02-02 Thread Jira
Michał Wieleba created SPARK-38088:
--

 Summary: Kryo DataWritingSparkTaskResult registration error
 Key: SPARK-38088
 URL: https://issues.apache.org/jira/browse/SPARK-38088
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Michał Wieleba


Spark 3.1.2, Scala 2.12

I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. 
Inside there are spark structured streaming code. Following settings are added 
as well:

sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrationRequired", "true")

Unfortunately, during execution following error is thrown:

Caused by: java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult
Note: To register this class use: 
kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class);
 

As far as I can see in 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala]
 

class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult 
is private (private[v2] case class DataWritingSparkTaskResult) therefore not 
available to register.

 

!image-2022-02-02-10-06-32-342.png!

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37908) Refactoring on pod label test in BasicFeatureStepSuite

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37908.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35209
[https://github.com/apache/spark/pull/35209]

> Refactoring on pod label test in BasicFeatureStepSuite
> --
>
> Key: SPARK-37908
> URL: https://issues.apache.org/jira/browse/SPARK-37908
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37908) Refactoring on pod label test in BasicFeatureStepSuite

2022-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37908:
-

Assignee: Yikun Jiang

> Refactoring on pod label test in BasicFeatureStepSuite
> --
>
> Key: SPARK-37908
> URL: https://issues.apache.org/jira/browse/SPARK-37908
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org