[jira] [Comment Edited] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486268#comment-17486268 ] Sujit Biswas edited comment on SPARK-38061 at 2/3/22, 7:49 AM: --- also if some of the issues are resolved, how to get the build that has the fixes, are the jackson-databind and log4j issue fixed in [Spark 3.2.1|https://spark.apache.org/releases/spark-release-3-2-1.html] was (Author: JIRAUSER284395): also if some of the issues are resolved, how to get the build that has the fixes > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37936) Use error classes in the parsing errors of intervals
[ https://issues.apache.org/jira/browse/SPARK-37936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486269#comment-17486269 ] Apache Spark commented on SPARK-37936: -- User 'senthh' has created a pull request for this issue: https://github.com/apache/spark/pull/35386 > Use error classes in the parsing errors of intervals > > > Key: SPARK-37936 > URL: https://issues.apache.org/jira/browse/SPARK-37936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Modify the following methods in QueryParsingErrors: > * moreThanOneFromToUnitInIntervalLiteralError > * invalidIntervalLiteralError > * invalidIntervalFormError > * invalidFromToUnitValueError > * fromToIntervalUnsupportedError > * mixedIntervalUnitsError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486268#comment-17486268 ] Sujit Biswas commented on SPARK-38061: -- also if some of the issues are resolved, how to get the build that has the fixes > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486264#comment-17486264 ] Sujit Biswas commented on SPARK-38061: -- not at all helpful, please refer to valid reason why something like this will not affect any spark stop,CRITICAL,false,"Vulnerability found in non-os package type (java) - /opt/spark/jars/log4j-1.2.17.jar (GHSA-2qrg-x229-3v8q - [https://github.com/advisories/GHSA-2qrg-x229-3v8q] )","GHSA-2qrg-x229-3v8q+log4j-1.2.17.jar",package,vulnerabilities > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently
[ https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486263#comment-17486263 ] john commented on SPARK-38058: -- since i am working in production env i cannot disclose any docs in here. this may be bug in spark. it happend every 3/5 times. for 2 times all the records are inserted correctly. other times duplicats are inserted. we have tried all workarounds it is not working > Writing a spark dataframe to Azure Sql Server is causing duplicate records > intermittently > - > > Key: SPARK-38058 > URL: https://issues.apache.org/jira/browse/SPARK-38058 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.1.0 >Reporter: john >Priority: Major > > We are using JDBC option to insert transformed data in a spark DataFrame to a > table in Azure SQL Server. Below is the code snippet we are using for this > insert. However, we noticed on few occasions that some records are being > duplicated in the destination table. This is happening for large tables. e.g. > if a DataFrame has 600K records, after inserting data into the table, we get > around 620K records. we still want to understand why that's happening. > {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = > "overwrite", properties = jdbcConnectionProperties)}} > > Only reason we could think of is that while inserts are happening in > distributed fashion, if one of the executors fail in between, they are being > re-tried and could be inserting duplicate records. This could be totally > meaningless but just to see if that could be an issue.{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486254#comment-17486254 ] Hyukjin Kwon commented on SPARK-38061: -- No, the security report here simply mentions the issues in their own libraries themselves. We don't know if they actually affect Spark or not, and we should proceed the upgrade separately for each ticket. > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486246#comment-17486246 ] Sujit Biswas commented on SPARK-38061: -- info is there in the attachment, you can do that > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38094) Parquet: enable matching schema columns by field id
[ https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38094: Assignee: (was: Apache Spark) > Parquet: enable matching schema columns by field id > --- > > Key: SPARK-38094 > URL: https://issues.apache.org/jira/browse/SPARK-38094 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.3 >Reporter: Jackie Zhang >Priority: Major > > Field Id is a native field in the Parquet schema > ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) > After this PR, when the requested schema has field IDs, Parquet readers will > first use the field ID to determine which Parquet columns to read, before > falling back to using column names as before. It enables matching columns by > field id for supported DWs like iceberg and Delta. > This PR supports: > * OSS vectorized reader > does not support: > * Parquet-mr reader due to lack of field id support (needs a follow up > ticket) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38094) Parquet: enable matching schema columns by field id
[ https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38094: Assignee: Apache Spark > Parquet: enable matching schema columns by field id > --- > > Key: SPARK-38094 > URL: https://issues.apache.org/jira/browse/SPARK-38094 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.3 >Reporter: Jackie Zhang >Assignee: Apache Spark >Priority: Major > > Field Id is a native field in the Parquet schema > ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) > After this PR, when the requested schema has field IDs, Parquet readers will > first use the field ID to determine which Parquet columns to read, before > falling back to using column names as before. It enables matching columns by > field id for supported DWs like iceberg and Delta. > This PR supports: > * OSS vectorized reader > does not support: > * Parquet-mr reader due to lack of field id support (needs a follow up > ticket) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38094) Parquet: enable matching schema columns by field id
[ https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486237#comment-17486237 ] Apache Spark commented on SPARK-38094: -- User 'jackierwzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/35385 > Parquet: enable matching schema columns by field id > --- > > Key: SPARK-38094 > URL: https://issues.apache.org/jira/browse/SPARK-38094 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.3 >Reporter: Jackie Zhang >Priority: Major > > Field Id is a native field in the Parquet schema > ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) > After this PR, when the requested schema has field IDs, Parquet readers will > first use the field ID to determine which Parquet columns to read, before > falling back to using column names as before. It enables matching columns by > field id for supported DWs like iceberg and Delta. > This PR supports: > * OSS vectorized reader > does not support: > * Parquet-mr reader due to lack of field id support (needs a follow up > ticket) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486236#comment-17486236 ] Hyukjin Kwon commented on SPARK-38061: -- [~sujitbiswas] Let's separate a ticket for each. We should identify which affect Spark, and upgrade dep one by one instead of doing it in batch with pulling unrelated dependency upgrade together. > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486231#comment-17486231 ] Sujit Biswas commented on SPARK-38061: -- [~hyukjin.kwon] note jackson-databind solves only part of the problem, example log4j-1.2.17.jar causing critical CVE, there are several other HIGH CVEs, please see the attached csv in the bug attachment section > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38073) NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > 3.7
[ https://issues.apache.org/jira/browse/SPARK-38073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486219#comment-17486219 ] Hyukjin Kwon commented on SPARK-38073: -- I think we should fix this .. > NameError: name 'sc' is not defined when running driver with IPyhon and Pyhon > > 3.7 > --- > > Key: SPARK-38073 > URL: https://issues.apache.org/jira/browse/SPARK-38073 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Shell >Affects Versions: 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > When {{PYSPARK_DRIVER_PYTHON=$(which ipython) bin/pyspark}} is executed with > Python >= 3.8, function registered wiht atexit seems to be executed in > different scope than in Python 3.7. > It result in {{NameError: name 'sc' is not defined}} on exit: > {code:python} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > Using Python version 3.8.12 (default, Oct 12 2021 21:57:06) > Spark context Web UI available at http://192.168.0.198:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1643555855409). > SparkSession available as 'spark'. > In [1]: > > > Do you really want to exit ([y]/n)? y > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/path/to/spark/python/pyspark/shell.py", line 49, in > atexit.register(lambda: sc.stop()) > NameError: name 'sc' is not defined > {code} > This could be easily fixed by capturing `sc` instance > {code:none} > diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py > index f0c487877a..4164e3ab0c 100644 > --- a/python/pyspark/shell.py > +++ b/python/pyspark/shell.py > @@ -46,7 +46,7 @@ except Exception: > > sc = spark.sparkContext > sql = spark.sql > -atexit.register(lambda: sc.stop()) > +atexit.register((lambda sc: lambda: sc.stop())(sc)) > > # for compatibility > sqlContext = spark._wrapped > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38074) RuntimeError: Java gateway process exited before sending its port number
[ https://issues.apache.org/jira/browse/SPARK-38074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38074. -- Resolution: Invalid > RuntimeError: Java gateway process exited before sending its port number > > > Key: SPARK-38074 > URL: https://issues.apache.org/jira/browse/SPARK-38074 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, Spark Submit >Affects Versions: 3.2.1 >Reporter: Malla >Priority: Major > > I am getting > RuntimeError: Java gateway process exited before sending its port number when > running python tests in Docker. > > I am using *spark_home* as /usr/lib/python3.9/site-packages/pyspark > > sc = SparkContext() File > "/usr/lib/python3.9/site-packages/pyspark/context.py", line 144, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File > "/usr/lib/python3.9/site-packages/pyspark/context.py", line 339, in > _ensure_initialized > SparkContext._gateway = gateway or launch_gateway(conf) > File "/usr/lib/python3.9/site-packages/pyspark/java_gateway.py", line 136, in > launch_gateway raise RuntimeError("Java gateway process exited before sending > its port number") > RuntimeError: Java gateway process exited before sending its port number > > I can provide additional details. Any Help is appreciated -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38074) RuntimeError: Java gateway process exited before sending its port number
[ https://issues.apache.org/jira/browse/SPARK-38074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486216#comment-17486216 ] Hyukjin Kwon commented on SPARK-38074: -- It's very likely your network configuration issue. Would be great to interact with mailing list first before filing it as an issue. > RuntimeError: Java gateway process exited before sending its port number > > > Key: SPARK-38074 > URL: https://issues.apache.org/jira/browse/SPARK-38074 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, Spark Submit >Affects Versions: 3.2.1 >Reporter: Malla >Priority: Major > > I am getting > RuntimeError: Java gateway process exited before sending its port number when > running python tests in Docker. > > I am using *spark_home* as /usr/lib/python3.9/site-packages/pyspark > > sc = SparkContext() File > "/usr/lib/python3.9/site-packages/pyspark/context.py", line 144, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File > "/usr/lib/python3.9/site-packages/pyspark/context.py", line 339, in > _ensure_initialized > SparkContext._gateway = gateway or launch_gateway(conf) > File "/usr/lib/python3.9/site-packages/pyspark/java_gateway.py", line 136, in > launch_gateway raise RuntimeError("Java gateway process exited before sending > its port number") > RuntimeError: Java gateway process exited before sending its port number > > I can provide additional details. Any Help is appreciated -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38074) RuntimeError: Java gateway process exited before sending its port number
[ https://issues.apache.org/jira/browse/SPARK-38074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38074: - Priority: Major (was: Blocker) > RuntimeError: Java gateway process exited before sending its port number > > > Key: SPARK-38074 > URL: https://issues.apache.org/jira/browse/SPARK-38074 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, Spark Submit >Affects Versions: 3.2.1 >Reporter: Malla >Priority: Major > > I am getting > RuntimeError: Java gateway process exited before sending its port number when > running python tests in Docker. > > I am using *spark_home* as /usr/lib/python3.9/site-packages/pyspark > > sc = SparkContext() File > "/usr/lib/python3.9/site-packages/pyspark/context.py", line 144, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File > "/usr/lib/python3.9/site-packages/pyspark/context.py", line 339, in > _ensure_initialized > SparkContext._gateway = gateway or launch_gateway(conf) > File "/usr/lib/python3.9/site-packages/pyspark/java_gateway.py", line 136, in > launch_gateway raise RuntimeError("Java gateway process exited before sending > its port number") > RuntimeError: Java gateway process exited before sending its port number > > I can provide additional details. Any Help is appreciated -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38087) select doesnt validate if the column already exists
[ https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486215#comment-17486215 ] Deepa Vasanthkumar commented on SPARK-38087: [~dongjoon] Thank you, not sure whether this is issue or not. > select doesnt validate if the column already exists > --- > > Key: SPARK-38087 > URL: https://issues.apache.org/jira/browse/SPARK-38087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: Version{{{}v3.2.1{}}} > {{}} > {{{}{}}}Master{{{}local[*]{}}} > {{(Reproducible in any environment)}} >Reporter: Deepa Vasanthkumar >Priority: Minor > Attachments: select vs drop.png > > > > Select doesnt validate whether the alias column is already present in the > dataframe. > After which, we cannot do anything in that dataframe on that column. > df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception > df4.show() > > However drop will not let you drop the said column. > > Scenario to reproduce : > df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will > add same column > df2.show() > df2.drop(df2.firstname) --> this will give AnalysisException: Reference > 'firstname' is ambiguous, could be: firstname, firstname. > > > Is this expected behavior . > !select vs drop.png! > !image-2022-02-02-06-28-23-543.png! > > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38095. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35384 [https://github.com/apache/spark/pull/35384] > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486203#comment-17486203 ] Hyukjin Kwon commented on SPARK-38082: -- Yeah .. we should probably upgrade the minimum version - it's too old. > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38061. -- Resolution: Duplicate > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38061) security scan issue jackson-databinding HDFS dependency library
[ https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486202#comment-17486202 ] Hyukjin Kwon commented on SPARK-38061: -- That's already upgraded at SPARK-35550 > security scan issue jackson-databinding HDFS dependency library > --- > > Key: SPARK-38061 > URL: https://issues.apache.org/jira/browse/SPARK-38061 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Security >Affects Versions: 3.2.0 >Reporter: Sujit Biswas >Priority: Major > Attachments: scan-security-report-spark-3.2.0-jre-11.csv > > > Hi, > running into security scan issue with docker image built on > spark-3.2.0-bin-hadoop3.2, is there a way to resolve > > most issues related to https://issues.apache.org/jira/browse/HDFS-15333 > attaching the CVE report > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently
[ https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486201#comment-17486201 ] Hyukjin Kwon commented on SPARK-38058: -- spark.speculation has been disabled many years ago so this should not be the cause. Did you enable this? It is difficult to debug more without details here. do you have more info e.g., logs or Spark UI screenshot, etc? Or are you able to reproduce this in other DBMS? > Writing a spark dataframe to Azure Sql Server is causing duplicate records > intermittently > - > > Key: SPARK-38058 > URL: https://issues.apache.org/jira/browse/SPARK-38058 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.1.0 >Reporter: john >Priority: Major > > We are using JDBC option to insert transformed data in a spark DataFrame to a > table in Azure SQL Server. Below is the code snippet we are using for this > insert. However, we noticed on few occasions that some records are being > duplicated in the destination table. This is happening for large tables. e.g. > if a DataFrame has 600K records, after inserting data into the table, we get > around 620K records. we still want to understand why that's happening. > {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = > "overwrite", properties = jdbcConnectionProperties)}} > > Only reason we could think of is that while inserts are happening in > distributed fashion, if one of the executors fail in between, they are being > re-tried and could be inserting duplicate records. This could be totally > meaningless but just to see if that could be an issue.{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38066) evaluateEquality should ignore attribute without min/max ColumnStat
[ https://issues.apache.org/jira/browse/SPARK-38066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486200#comment-17486200 ] Apache Spark commented on SPARK-38066: -- User 'Stove-hust' has created a pull request for this issue: https://github.com/apache/spark/pull/35363 > evaluateEquality should ignore attribute without min/max ColumnStat > --- > > Key: SPARK-38066 > URL: https://issues.apache.org/jira/browse/SPARK-38066 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Fencheng Mei >Priority: Minor > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > After opening CBO, when the colStatsMap of a attribute does not have > min/max, evaluateEquality method should return None, not 0 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38066) evaluateEquality should ignore attribute without min/max ColumnStat
[ https://issues.apache.org/jira/browse/SPARK-38066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38066: Assignee: (was: Apache Spark) > evaluateEquality should ignore attribute without min/max ColumnStat > --- > > Key: SPARK-38066 > URL: https://issues.apache.org/jira/browse/SPARK-38066 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Fencheng Mei >Priority: Minor > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > After opening CBO, when the colStatsMap of a attribute does not have > min/max, evaluateEquality method should return None, not 0 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38066) evaluateEquality should ignore attribute without min/max ColumnStat
[ https://issues.apache.org/jira/browse/SPARK-38066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38066: Assignee: Apache Spark > evaluateEquality should ignore attribute without min/max ColumnStat > --- > > Key: SPARK-38066 > URL: https://issues.apache.org/jira/browse/SPARK-38066 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Fencheng Mei >Assignee: Apache Spark >Priority: Minor > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > After opening CBO, when the colStatsMap of a attribute does not have > min/max, evaluateEquality method should return None, not 0 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486181#comment-17486181 ] Apache Spark commented on SPARK-38095: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35384 > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38095: - Assignee: Dongjoon Hyun > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38095: - Assignee: Dongjoon Hyun (was: Apache Spark) > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38095: Assignee: (was: Dongjoon Hyun) > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486180#comment-17486180 ] Apache Spark commented on SPARK-38095: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35384 > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38095: Assignee: Apache Spark > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38095: -- Parent: SPARK-35781 Issue Type: Sub-task (was: Bug) > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend-based extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38095: -- Summary: HistoryServerDiskManager.appStorePath should use backend-based extensions (was: HistoryServerDiskManager.appStorePath should use backend extensions) > HistoryServerDiskManager.appStorePath should use backend-based extensions > - > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38095) HistoryServerDiskManager.appStorePath should use backend extensions
[ https://issues.apache.org/jira/browse/SPARK-38095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38095: -- Summary: HistoryServerDiskManager.appStorePath should use backend extensions (was: HistoryServerDiskManager should use backend extensions for `apps` directory) > HistoryServerDiskManager.appStorePath should use backend extensions > --- > > Key: SPARK-38095 > URL: https://issues.apache.org/jira/browse/SPARK-38095 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38095) HistoryServerDiskManager should use backend extensions for `apps` directory
Dongjoon Hyun created SPARK-38095: - Summary: HistoryServerDiskManager should use backend extensions for `apps` directory Key: SPARK-38095 URL: https://issues.apache.org/jira/browse/SPARK-38095 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37958) Pyspark SparkContext.AddFile() does not respect spark.files.overwrite
[ https://issues.apache.org/jira/browse/SPARK-37958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37958. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35377 [https://github.com/apache/spark/pull/35377] > Pyspark SparkContext.AddFile() does not respect spark.files.overwrite > - > > Key: SPARK-37958 > URL: https://issues.apache.org/jira/browse/SPARK-37958 > Project: Spark > Issue Type: Bug > Components: Documentation, Input/Output, Java API >Affects Versions: 3.1.1 >Reporter: taylor schneider >Assignee: Leona Yoda >Priority: Major > Fix For: 3.3.0 > > > I am currently running apache spark 3.1.1. on kubernetes. > When I try to re-add a file that has already been added I see that the > updated file is not actually loaded into the cluster. I see the following > warning when calling the addFile() function. > {code:java} > 22/01/18 19:05:50 WARN SparkContext: The path > http://15.4.12.12:80/demo_data.csv has been added already. Overwriting of > added paths is not supported in the current version. {code} > When I display the dataframe that was loaded I see that the old data is > loaded. If I log into the worker pods and delete the file, the same results > or observed. > My SparkConf has the following configurations > {code:java} > ('spark.master', 'k8s://https://15.4.7.11:6443') > ('spark.app.name', 'spark-jupyter-mlib') > ('spark.submit.deploy.mode', 'cluster') > ('spark.kubernetes.container.image', 'tschneider/apache-spark-k8:v7') > ('spark.kubernetes.namespace', 'spark') > ('spark.kubernetes.pyspark.pythonVersion', '3') > ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa') > ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa') > ('spark.executor.instances', '3') > ('spark.executor.cores', '2') > ('spark.executor.memory', '4096m') > ('spark.executor.memoryOverhead', '1024m') > ('spark.driver.memory', '1024m') > ('spark.driver.host', '15.4.12.12') > ('spark.files.overwrite', 'true') > ('spark.files.useFetchCache', 'false') {code} > According to the documentation for 3.1.1. The spark.files.overwrite parameter > should in fact load the updated files. The documentation can be found here: > [https://spark.apache.org/docs/3.1.1/configuration.html] > The only workaround is to use a python function to manually delete and > re-download the file. Calling addFile still shows the warning in this case. > My code for the delete and redownload is as follows: > {code:java} > def os_remove(file_path): > import socket > hostname = socket.gethostname() action = None > import os > if os.path.exists(file_path): > action = "delete" > os.remove(file_path) > > return (hostname, action)worker_file_path = > u"file:///{0}".format(csv_file_name) > worker_count = int(spark_session.conf.get('spark.executor.instances')) > rdd = sc.parallelize(range(worker_count)).map(lambda var: > os_remove(worker_file_path)) > rdd.collect() > def download_updated_file(file_url): > import urllib.parse as parse > file_name = os.path.basename(parse.urlparse(csv_file_url).path) > local_file_path = "/{0}".format(file_name) > > import urllib.request as urllib > urllib.urlretrieve(file_url, local_file_path) > > rdd = sc.parallelize(range(worker_count)).map(lambda var: > download_updated_file(csv_file_url)) > rdd.collect(){code} > I believe this is either a bug or a documentation mistake. Perhaps the > configuration parameter has a misleading description? > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37958) Pyspark SparkContext.AddFile() does not respect spark.files.overwrite
[ https://issues.apache.org/jira/browse/SPARK-37958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37958: Assignee: Leona Yoda > Pyspark SparkContext.AddFile() does not respect spark.files.overwrite > - > > Key: SPARK-37958 > URL: https://issues.apache.org/jira/browse/SPARK-37958 > Project: Spark > Issue Type: Bug > Components: Documentation, Input/Output, Java API >Affects Versions: 3.1.1 >Reporter: taylor schneider >Assignee: Leona Yoda >Priority: Major > > I am currently running apache spark 3.1.1. on kubernetes. > When I try to re-add a file that has already been added I see that the > updated file is not actually loaded into the cluster. I see the following > warning when calling the addFile() function. > {code:java} > 22/01/18 19:05:50 WARN SparkContext: The path > http://15.4.12.12:80/demo_data.csv has been added already. Overwriting of > added paths is not supported in the current version. {code} > When I display the dataframe that was loaded I see that the old data is > loaded. If I log into the worker pods and delete the file, the same results > or observed. > My SparkConf has the following configurations > {code:java} > ('spark.master', 'k8s://https://15.4.7.11:6443') > ('spark.app.name', 'spark-jupyter-mlib') > ('spark.submit.deploy.mode', 'cluster') > ('spark.kubernetes.container.image', 'tschneider/apache-spark-k8:v7') > ('spark.kubernetes.namespace', 'spark') > ('spark.kubernetes.pyspark.pythonVersion', '3') > ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa') > ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa') > ('spark.executor.instances', '3') > ('spark.executor.cores', '2') > ('spark.executor.memory', '4096m') > ('spark.executor.memoryOverhead', '1024m') > ('spark.driver.memory', '1024m') > ('spark.driver.host', '15.4.12.12') > ('spark.files.overwrite', 'true') > ('spark.files.useFetchCache', 'false') {code} > According to the documentation for 3.1.1. The spark.files.overwrite parameter > should in fact load the updated files. The documentation can be found here: > [https://spark.apache.org/docs/3.1.1/configuration.html] > The only workaround is to use a python function to manually delete and > re-download the file. Calling addFile still shows the warning in this case. > My code for the delete and redownload is as follows: > {code:java} > def os_remove(file_path): > import socket > hostname = socket.gethostname() action = None > import os > if os.path.exists(file_path): > action = "delete" > os.remove(file_path) > > return (hostname, action)worker_file_path = > u"file:///{0}".format(csv_file_name) > worker_count = int(spark_session.conf.get('spark.executor.instances')) > rdd = sc.parallelize(range(worker_count)).map(lambda var: > os_remove(worker_file_path)) > rdd.collect() > def download_updated_file(file_url): > import urllib.parse as parse > file_name = os.path.basename(parse.urlparse(csv_file_url).path) > local_file_path = "/{0}".format(file_name) > > import urllib.request as urllib > urllib.urlretrieve(file_url, local_file_path) > > rdd = sc.parallelize(range(worker_count)).map(lambda var: > download_updated_file(csv_file_url)) > rdd.collect(){code} > I believe this is either a bug or a documentation mistake. Perhaps the > configuration parameter has a misleading description? > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38094) Parquet: enable matching schema columns by field id
Jackie Zhang created SPARK-38094: Summary: Parquet: enable matching schema columns by field id Key: SPARK-38094 URL: https://issues.apache.org/jira/browse/SPARK-38094 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.3 Reporter: Jackie Zhang Field Id is a native field in the Parquet schema ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read, before falling back to using column names as before. It enables matching columns by field id for supported DWs like iceberg and Delta. This PR supports: * OSS vectorized reader does not support: * Parquet-mr reader due to lack of field id support (needs a follow up ticket) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists
[ https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38087: -- Fix Version/s: (was: 3.3) > select doesnt validate if the column already exists > --- > > Key: SPARK-38087 > URL: https://issues.apache.org/jira/browse/SPARK-38087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: Version{{{}v3.2.1{}}} > {{}} > {{{}{}}}Master{{{}local[*]{}}} > {{(Reproducible in any environment)}} >Reporter: Deepa Vasanthkumar >Priority: Minor > Attachments: select vs drop.png > > > > Select doesnt validate whether the alias column is already present in the > dataframe. > After which, we cannot do anything in that dataframe on that column. > df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception > df4.show() > > However drop will not let you drop the said column. > > Scenario to reproduce : > df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will > add same column > df2.show() > df2.drop(df2.firstname) --> this will give AnalysisException: Reference > 'firstname' is ambiguous, could be: firstname, firstname. > > > Is this expected behavior . > !select vs drop.png! > !image-2022-02-02-06-28-23-543.png! > > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38087) select doesnt validate if the column already exists
[ https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486167#comment-17486167 ] Dongjoon Hyun commented on SPARK-38087: --- I removed the fixed version field, [~deepa.vasanthkumar]. > select doesnt validate if the column already exists > --- > > Key: SPARK-38087 > URL: https://issues.apache.org/jira/browse/SPARK-38087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: Version{{{}v3.2.1{}}} > {{}} > {{{}{}}}Master{{{}local[*]{}}} > {{(Reproducible in any environment)}} >Reporter: Deepa Vasanthkumar >Priority: Minor > Attachments: select vs drop.png > > > > Select doesnt validate whether the alias column is already present in the > dataframe. > After which, we cannot do anything in that dataframe on that column. > df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception > df4.show() > > However drop will not let you drop the said column. > > Scenario to reproduce : > df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will > add same column > df2.show() > df2.drop(df2.firstname) --> this will give AnalysisException: Reference > 'firstname' is ambiguous, could be: firstname, firstname. > > > Is this expected behavior . > !select vs drop.png! > !image-2022-02-02-06-28-23-543.png! > > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)
[ https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30062: -- Fix Version/s: 3.3.0 (was: 3.3) > bug with DB2Driver using mode("overwrite") option("truncate",True) > -- > > Key: SPARK-30062 > URL: https://issues.apache.org/jira/browse/SPARK-30062 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Guy Huinen >Assignee: Ivan Karol >Priority: Major > Labels: db2, pyspark > Fix For: 3.3.0, 3.2.2 > > > using DB2Driver using mode("overwrite") option("truncate",True) gives sql > error > > {code:java} > dfClient.write\ > .format("jdbc")\ > .mode("overwrite")\ > .option('driver', 'com.ibm.db2.jcc.DB2Driver')\ > .option("url","jdbc:db2://")\ > .option("user","xxx")\ > .option("password","")\ > .option("dbtable","")\ > .option("truncate",True)\{code} > > gives the error below > in summary i belief the semicolon is misplaced or malformated > > {code:java} > EXPO.EXPO#CMR_STG;IMMEDIATE{code} > > > full error > {code:java} > An error occurred while calling o47.save. : > com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, > SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, > DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at > com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) > at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at > com.ibm.db2.jcc.am.kh.d(kh.java:2776) at > com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) > at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) > at com.ibm.db2.jcc.t4.av.h(av.java:124) at > com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at > com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) > at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at >
[jira] [Updated] (SPARK-38089) Show the root cause exception in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38089: -- Summary: Show the root cause exception in TestUtils.assertExceptionMsg (was: Improve assertion failure message in TestUtils.assertExceptionMsg) > Show the root cause exception in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.3.0 > > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38089. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35383 [https://github.com/apache/spark/pull/35383] > Improve assertion failure message in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.3.0 > > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38089) Show the root cause exception in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38089: -- Affects Version/s: 3.3.0 (was: 3.2.1) > Show the root cause exception in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.3.0 > > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38089: - Assignee: Erik Krogen > Improve assertion failure message in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37996) Contribution guide is stale
[ https://issues.apache.org/jira/browse/SPARK-37996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486139#comment-17486139 ] Khalid Mammadov commented on SPARK-37996: - Raised PR: https://github.com/apache/spark-website/pull/378 With following changes: * It describes in the Pull request section of the Contributing page the actual procedure and takes a contributor through a step by step process. * It removes optional "Running tests in your forked repository" section on Developer Tools page which is obsolete now and doesn't reflect reality anymore i.e. it says we can test by clicking “Run workflow” button which is not available anymore as workflow does not use "workflow_dispatch" event trigger anymore and was removed in * [[SPARK-35048][INFRA] Distribute GitHub Actions workflows to fork repositories to share the resources spark#32092|https://github.com/apache/spark/pull/32092] * Instead it documents the new procedure that above PR introduced i.e. contributors needs to use their own GitHub free workflow credits to test new changes they are purposing and a Spark Actions workflow will expect that to be completed before marking PR to be ready for a review. * Some general wording was copied from "Running tests in your forked repository" section on Developer Tools page but main content was rewritten to meet objective * Also fixed URL to developer-tools.html to be resolved by parser (that converted it into relative URI) instead of using hard coded absolute URL. > Contribution guide is stale > --- > > Key: SPARK-37996 > URL: https://issues.apache.org/jira/browse/SPARK-37996 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Khalid Mammadov >Priority: Minor > > Contribution guide mentions below link to use to test on local repo before > raising PR but the process has changed and documentation does not reflect it. > https://spark.apache.org/developer-tools.html#github-workflow-tests > Only digging into git log of " > [.github/workflows/build_and_test.yml|https://github.com/apache/spark/commit/2974b70d1efd4b1c5cfe7e2467766f0a9a1fec82#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2]; > I managed to find what the new process is. It was changed in > [https://github.com/apache/spark/pull/32092] but documentation was not > updated. > I am happy to contribute to fix it but apparently > [https://spark.apache.org/developer-tools.html] is hosted in Apache website > rather that in the Spark source code -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-37771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486107#comment-17486107 ] Ivan Sadikov commented on SPARK-37771: -- I could not manage to work around the issue with Hadoop 3.3.1 binaries, it still persists. Shared prefixes config works; however, I found there are more issues with IsolatedClassLoader which might need to be fixed, e.g. the incorrect parent class loader is passed to IsolatedClassLoader in certain situations - I am debugging this now. No updates on the fix yet, workaround with the config works, and the issue is not blocking me at the moment. > Race condition in withHiveState and limited logic in IsolatedClientLoader > result in ClassNotFoundException > -- > > Key: SPARK-37771 > URL: https://issues.apache.org/jira/browse/SPARK-37771 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.2, 3.2.0 >Reporter: Ivan Sadikov >Priority: Major > > There is a race condition between creating a Hive client and loading classes > that do not appear in shared prefixes config. For example, we confirmed that > the code fails for the following configuration: > {code:java} > spark.sql.hive.metastore.version 0.13.0 > spark.sql.hive.metastore.jars maven > spark.sql.hive.metastore.sharedPrefixes com.amazonaws prefix> > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem{code} > And code: > {code:java} > -- Prerequisite commands to set up the table > -- drop table if exists ivan_test_2; > -- create table ivan_test_2 (a int, part string) using csv location > 's3://bucket/hive-test' partitioned by (part); > -- insert into ivan_test_2 values (1, 'a'); > -- Command that triggers failure > ALTER TABLE ivan_test_2 ADD PARTITION (part='b') LOCATION > 's3://bucket/hive-test'{code} > > Stacktrace (line numbers might differ): > {code:java} > 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: > org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider > 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: > org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider > 21/12/22 04:37:05 DEBUG IsolatedClientLoader: hive class: > com.amazonaws.auth.EnvironmentVariableCredentialsProvider - null > 21/12/22 04:37:05 ERROR S3AFileSystem: Failed to initialize S3AFileSystem for > path s3://bucket/hive-test > java.io.IOException: From option fs.s3a.aws.credentials.provider > java.lang.ClassNotFoundException: Class > com.amazonaws.auth.EnvironmentVariableCredentialsProvider not found > at > org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:725) > at > org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:688) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:411) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) > at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) > at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112) > at > org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createLocationForAddedPartition(HiveMetaStore.java:1993) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:1865) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:1910) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) > at com.sun.proxy.$Proxy58.add_partitions_req(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:457) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Commented] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time
[ https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486092#comment-17486092 ] Erik Krogen commented on SPARK-38091: - [~Zhen-hao] for formatting you need to use the Atlassian markup: [https://jira.atlassian.com/secure/WikiRendererHelpAction.jspa?section=all] Basically replace ` ... ` with \{{ ... }} and replace ``` ... ``` with \{code} ... \{code} > AvroSerializer can cause java.lang.ClassCastException at run time > - > > Key: SPARK-38091 > URL: https://issues.apache.org/jira/browse/SPARK-38091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: Zhenhao Li >Priority: Major > Labels: Avro, serializers > > `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% > based on the `InternalRow` and `SpecializedGetters` interface. It assumes > many implementation details of the interface. > For example, in > ```scala > case (TimestampType, LONG) => avroType.getLogicalType match { > // For backward compatibility, if the Avro type is Long and it is > not logical type > // (the `null` case), output the timestamp value as with > millisecond precision. > case null | _: TimestampMillis => (getter, ordinal) => > > DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal))) > case _: TimestampMicros => (getter, ordinal) => > timestampRebaseFunc(getter.getLong(ordinal)) > case other => throw new IncompatibleSchemaException(errorPrefix + > s"SQL type ${TimestampType.sql} cannot be converted to Avro > logical type $other") > } > ``` > it assumes the `InternalRow` instance encodes `TimestampType` as > `java.lang.Long`. That's true for `Unsaferow` but not for > `GenericInternalRow`. > Hence the above code will end up with runtime exceptions when used on an > instance of `GenericInternalRow`, which is the case for Python UDF. > I didn't get time to dig deeper than that. I got the impression that Spark's > optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't > involve the optimizer(s) and hence each row is a `GenericInternalRow`. > It would be great if someone can correct me or offer a better explanation. > > To reproduce the issue, > `git checkout master` and `git cherry-pick --no-commit` > [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88] > and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`. > > You will see runtime exceptions like the following one > ``` > - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED *** > java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to > class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module > java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in > unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283) > at > org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217) > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38093) Set shuffleMergeAllowed to false for a determinate stage after the stage is finalized
Venkata krishnan Sowrirajan created SPARK-38093: --- Summary: Set shuffleMergeAllowed to false for a determinate stage after the stage is finalized Key: SPARK-38093 URL: https://issues.apache.org/jira/browse/SPARK-38093 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.2.1 Reporter: Venkata krishnan Sowrirajan Currently we are setting shuffleMergeAllowed to false before prepareShuffleServicesForShuffleMapStage if the shuffle dependency is already finalized. Ideally it is better to do it right after shuffle dependency finalization for a determinate stage. cc [~mridulm80] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38092) Check if shuffleMergeId is the same as the current stage's shuffleMergeId before registering MergeStatus
Venkata krishnan Sowrirajan created SPARK-38092: --- Summary: Check if shuffleMergeId is the same as the current stage's shuffleMergeId before registering MergeStatus Key: SPARK-38092 URL: https://issues.apache.org/jira/browse/SPARK-38092 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.2.1 Reporter: Venkata krishnan Sowrirajan Currently we have handled this in the handleShuffleMergeFinalized during finalization ensuring the finalize request is indeed for the current stage's shuffle dependency shuffleMergeId. The same check has to be done before registering merge statuses as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38047) Add OUTLIER_NO_FALLBACK executor roll policy
[ https://issues.apache.org/jira/browse/SPARK-38047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Holmes updated SPARK-38047: Description: Currently executor rolling will always kill one executor every {{{}spark.kubernetes.executor.rollInterval{}}}. This may not be optimal in cases where the executor metric isn't an outlier compared to other executors. There is a cost associated with killing executors (ramp-up time for new executors for example) which applications may not want to incur for non-outlier executors. This ticket would add the ability to only kill executors if they are outliners via the introduction of a new roll policy. was: Currently executor rolling will always kill one executor every {{{}spark.kubernetes.executor.rollInterval{}}}. For some of the policies this may not be optimal in cases where the executor metric isn't an outlier compared to other executors. There is a cost associated with killing executors (ramp-up time for new executors for example) which applications may not want to incur for non-outlier executors. This ticket would add the ability to only kill executors if they are outliners. > Add OUTLIER_NO_FALLBACK executor roll policy > > > Key: SPARK-38047 > URL: https://issues.apache.org/jira/browse/SPARK-38047 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Alex Holmes >Assignee: Alex Holmes >Priority: Major > Fix For: 3.3.0 > > > Currently executor rolling will always kill one executor every > {{{}spark.kubernetes.executor.rollInterval{}}}. This may not be optimal in > cases where the executor metric isn't an outlier compared to other executors. > There is a cost associated with killing executors (ramp-up time for new > executors for example) which applications may not want to incur for > non-outlier executors. > > This ticket would add the ability to only kill executors if they are > outliners via the introduction of a new roll policy. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time
[ https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486041#comment-17486041 ] Apache Spark commented on SPARK-38091: -- User 'Zhen-hao' has created a pull request for this issue: https://github.com/apache/spark/pull/35379 > AvroSerializer can cause java.lang.ClassCastException at run time > - > > Key: SPARK-38091 > URL: https://issues.apache.org/jira/browse/SPARK-38091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: Zhenhao Li >Priority: Major > Labels: Avro, serializers > > `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% > based on the `InternalRow` and `SpecializedGetters` interface. It assumes > many implementation details of the interface. > For example, in > ```scala > case (TimestampType, LONG) => avroType.getLogicalType match { > // For backward compatibility, if the Avro type is Long and it is > not logical type > // (the `null` case), output the timestamp value as with > millisecond precision. > case null | _: TimestampMillis => (getter, ordinal) => > > DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal))) > case _: TimestampMicros => (getter, ordinal) => > timestampRebaseFunc(getter.getLong(ordinal)) > case other => throw new IncompatibleSchemaException(errorPrefix + > s"SQL type ${TimestampType.sql} cannot be converted to Avro > logical type $other") > } > ``` > it assumes the `InternalRow` instance encodes `TimestampType` as > `java.lang.Long`. That's true for `Unsaferow` but not for > `GenericInternalRow`. > Hence the above code will end up with runtime exceptions when used on an > instance of `GenericInternalRow`, which is the case for Python UDF. > I didn't get time to dig deeper than that. I got the impression that Spark's > optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't > involve the optimizer(s) and hence each row is a `GenericInternalRow`. > It would be great if someone can correct me or offer a better explanation. > > To reproduce the issue, > `git checkout master` and `git cherry-pick --no-commit` > [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88] > and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`. > > You will see runtime exceptions like the following one > ``` > - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED *** > java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to > class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module > java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in > unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283) > at > org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217) > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time
[ https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38091: Assignee: Apache Spark > AvroSerializer can cause java.lang.ClassCastException at run time > - > > Key: SPARK-38091 > URL: https://issues.apache.org/jira/browse/SPARK-38091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: Zhenhao Li >Assignee: Apache Spark >Priority: Major > Labels: Avro, serializers > > `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% > based on the `InternalRow` and `SpecializedGetters` interface. It assumes > many implementation details of the interface. > For example, in > ```scala > case (TimestampType, LONG) => avroType.getLogicalType match { > // For backward compatibility, if the Avro type is Long and it is > not logical type > // (the `null` case), output the timestamp value as with > millisecond precision. > case null | _: TimestampMillis => (getter, ordinal) => > > DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal))) > case _: TimestampMicros => (getter, ordinal) => > timestampRebaseFunc(getter.getLong(ordinal)) > case other => throw new IncompatibleSchemaException(errorPrefix + > s"SQL type ${TimestampType.sql} cannot be converted to Avro > logical type $other") > } > ``` > it assumes the `InternalRow` instance encodes `TimestampType` as > `java.lang.Long`. That's true for `Unsaferow` but not for > `GenericInternalRow`. > Hence the above code will end up with runtime exceptions when used on an > instance of `GenericInternalRow`, which is the case for Python UDF. > I didn't get time to dig deeper than that. I got the impression that Spark's > optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't > involve the optimizer(s) and hence each row is a `GenericInternalRow`. > It would be great if someone can correct me or offer a better explanation. > > To reproduce the issue, > `git checkout master` and `git cherry-pick --no-commit` > [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88] > and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`. > > You will see runtime exceptions like the following one > ``` > - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED *** > java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to > class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module > java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in > unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283) > at > org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217) > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time
[ https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38091: Assignee: (was: Apache Spark) > AvroSerializer can cause java.lang.ClassCastException at run time > - > > Key: SPARK-38091 > URL: https://issues.apache.org/jira/browse/SPARK-38091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: Zhenhao Li >Priority: Major > Labels: Avro, serializers > > `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% > based on the `InternalRow` and `SpecializedGetters` interface. It assumes > many implementation details of the interface. > For example, in > ```scala > case (TimestampType, LONG) => avroType.getLogicalType match { > // For backward compatibility, if the Avro type is Long and it is > not logical type > // (the `null` case), output the timestamp value as with > millisecond precision. > case null | _: TimestampMillis => (getter, ordinal) => > > DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal))) > case _: TimestampMicros => (getter, ordinal) => > timestampRebaseFunc(getter.getLong(ordinal)) > case other => throw new IncompatibleSchemaException(errorPrefix + > s"SQL type ${TimestampType.sql} cannot be converted to Avro > logical type $other") > } > ``` > it assumes the `InternalRow` instance encodes `TimestampType` as > `java.lang.Long`. That's true for `Unsaferow` but not for > `GenericInternalRow`. > Hence the above code will end up with runtime exceptions when used on an > instance of `GenericInternalRow`, which is the case for Python UDF. > I didn't get time to dig deeper than that. I got the impression that Spark's > optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't > involve the optimizer(s) and hence each row is a `GenericInternalRow`. > It would be great if someone can correct me or offer a better explanation. > > To reproduce the issue, > `git checkout master` and `git cherry-pick --no-commit` > [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88] > and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`. > > You will see runtime exceptions like the following one > ``` > - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED *** > java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to > class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module > java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in > unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283) > at > org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217) > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time
[ https://issues.apache.org/jira/browse/SPARK-38091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486036#comment-17486036 ] Zhenhao Li commented on SPARK-38091: Can someone tell me how to let Jira render markdown? > AvroSerializer can cause java.lang.ClassCastException at run time > - > > Key: SPARK-38091 > URL: https://issues.apache.org/jira/browse/SPARK-38091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: Zhenhao Li >Priority: Major > Labels: Avro, serializers > > `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% > based on the `InternalRow` and `SpecializedGetters` interface. It assumes > many implementation details of the interface. > For example, in > ```scala > case (TimestampType, LONG) => avroType.getLogicalType match { > // For backward compatibility, if the Avro type is Long and it is > not logical type > // (the `null` case), output the timestamp value as with > millisecond precision. > case null | _: TimestampMillis => (getter, ordinal) => > > DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal))) > case _: TimestampMicros => (getter, ordinal) => > timestampRebaseFunc(getter.getLong(ordinal)) > case other => throw new IncompatibleSchemaException(errorPrefix + > s"SQL type ${TimestampType.sql} cannot be converted to Avro > logical type $other") > } > ``` > it assumes the `InternalRow` instance encodes `TimestampType` as > `java.lang.Long`. That's true for `Unsaferow` but not for > `GenericInternalRow`. > Hence the above code will end up with runtime exceptions when used on an > instance of `GenericInternalRow`, which is the case for Python UDF. > I didn't get time to dig deeper than that. I got the impression that Spark's > optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't > involve the optimizer(s) and hence each row is a `GenericInternalRow`. > It would be great if someone can correct me or offer a better explanation. > > To reproduce the issue, > `git checkout master` and `git cherry-pick --no-commit` > [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88] > and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`. > > You will see runtime exceptions like the following one > ``` > - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED *** > java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to > class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module > java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in > unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135) > at > org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283) > at > org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67) > at > org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217) > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38091) AvroSerializer can cause java.lang.ClassCastException at run time
Zhenhao Li created SPARK-38091: -- Summary: AvroSerializer can cause java.lang.ClassCastException at run time Key: SPARK-38091 URL: https://issues.apache.org/jira/browse/SPARK-38091 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 3.0.1, 3.0.0 Reporter: Zhenhao Li `AvroSerializer`'s implementation, at least in `newConverter`, was not 100% based on the `InternalRow` and `SpecializedGetters` interface. It assumes many implementation details of the interface. For example, in ```scala case (TimestampType, LONG) => avroType.getLogicalType match { // For backward compatibility, if the Avro type is Long and it is not logical type // (the `null` case), output the timestamp value as with millisecond precision. case null | _: TimestampMillis => (getter, ordinal) => DateTimeUtils.microsToMillis(timestampRebaseFunc(getter.getLong(ordinal))) case _: TimestampMicros => (getter, ordinal) => timestampRebaseFunc(getter.getLong(ordinal)) case other => throw new IncompatibleSchemaException(errorPrefix + s"SQL type ${TimestampType.sql} cannot be converted to Avro logical type $other") } ``` it assumes the `InternalRow` instance encodes `TimestampType` as `java.lang.Long`. That's true for `Unsaferow` but not for `GenericInternalRow`. Hence the above code will end up with runtime exceptions when used on an instance of `GenericInternalRow`, which is the case for Python UDF. I didn't get time to dig deeper than that. I got the impression that Spark's optimizer(s) will turn a row into a `UnsafeRow` and Python UDF doesn't involve the optimizer(s) and hence each row is a `GenericInternalRow`. It would be great if someone can correct me or offer a better explanation. To reproduce the issue, `git checkout master` and `git cherry-pick --no-commit` [this-commit|https://github.com/Zhen-hao/spark/commit/1ffe8e8f35273b2f3529f6c2d004822f480e4c88] and run the test `org.apache.spark.sql.avro.AvroSerdeSuite`. You will see runtime exceptions like the following one ``` - Serialize DecimalType to Avro BYTES with logical type decimal *** FAILED *** java.lang.ClassCastException: class java.math.BigDecimal cannot be cast to class org.apache.spark.sql.types.Decimal (java.math.BigDecimal is in module java.base of loader 'bootstrap'; org.apache.spark.sql.types.Decimal is in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getDecimal$(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDecimal(rows.scala:195) at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10(AvroSerializer.scala:136) at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newConverter$10$adapted(AvroSerializer.scala:135) at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$2(AvroSerializer.scala:283) at org.apache.spark.sql.avro.AvroSerializer.serialize(AvroSerializer.scala:60) at org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5(AvroSerdeSuite.scala:82) at org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$new$5$adapted(AvroSerdeSuite.scala:67) at org.apache.spark.sql.avro.AvroSerdeSuite.$anonfun$withFieldMatchType$2(AvroSerdeSuite.scala:217) ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38089: Assignee: (was: Apache Spark) > Improve assertion failure message in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.1 >Reporter: Erik Krogen >Priority: Major > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486025#comment-17486025 ] Apache Spark commented on SPARK-38089: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/35383 > Improve assertion failure message in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.1 >Reporter: Erik Krogen >Priority: Major > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486027#comment-17486027 ] Apache Spark commented on SPARK-38089: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/35383 > Improve assertion failure message in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.1 >Reporter: Erik Krogen >Priority: Major > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg
[ https://issues.apache.org/jira/browse/SPARK-38089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38089: Assignee: Apache Spark > Improve assertion failure message in TestUtils.assertExceptionMsg > - > > Key: SPARK-38089 > URL: https://issues.apache.org/jira/browse/SPARK-38089 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.1 >Reporter: Erik Krogen >Assignee: Apache Spark >Priority: Major > > {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ > match, it can be challenging to tell why, because the exception tree that was > searched isn't printed. Only way I could find to fix it up was to run things > in a debugger and check the exception tree. > It would be very helpful if {{assertExceptionMsg}} printed out the exception > tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38090) Make links for stderr/stdout on Spark on Kube configurable
Holden Karau created SPARK-38090: Summary: Make links for stderr/stdout on Spark on Kube configurable Key: SPARK-38090 URL: https://issues.apache.org/jira/browse/SPARK-38090 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.3.0, 3.2.2 Reporter: Holden Karau Assignee: Holden Karau Unlike YARN different clusters store pod logs in different locations. We should allow people to configure the links so that they can go a web UI for their clusters stderr/stdout or print out the kubectl commands for users who don't have a link configured. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38089) Improve assertion failure message in TestUtils.assertExceptionMsg
Erik Krogen created SPARK-38089: --- Summary: Improve assertion failure message in TestUtils.assertExceptionMsg Key: SPARK-38089 URL: https://issues.apache.org/jira/browse/SPARK-38089 Project: Spark Issue Type: Improvement Components: Spark Core, Tests Affects Versions: 3.2.1 Reporter: Erik Krogen {{TestUtils.assertExceptionMsg}} is great, but when the assertion _doesn't_ match, it can be challenging to tell why, because the exception tree that was searched isn't printed. Only way I could find to fix it up was to run things in a debugger and check the exception tree. It would be very helpful if {{assertExceptionMsg}} printed out the exception tree in which it was searching (upon failure). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37145) Improvement for extending pod feature steps with KubernetesConf
[ https://issues.apache.org/jira/browse/SPARK-37145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37145. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35345 [https://github.com/apache/spark/pull/35345] > Improvement for extending pod feature steps with KubernetesConf > --- > > Key: SPARK-37145 > URL: https://issues.apache.org/jira/browse/SPARK-37145 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: wangxin201492 >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > SPARK-33261 provides us with great convenience, but it only construct a > `KubernetesFeatureConfigStep` with a empty construction method. > It would be better to use the construction method with `KubernetesConf` (or > more detail: `KubernetesDriverConf` and `KubernetesExecutorConf`) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37145) Improvement for extending pod feature steps with KubernetesConf
[ https://issues.apache.org/jira/browse/SPARK-37145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37145: - Assignee: Yikun Jiang > Improvement for extending pod feature steps with KubernetesConf > --- > > Key: SPARK-37145 > URL: https://issues.apache.org/jira/browse/SPARK-37145 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: wangxin201492 >Assignee: Yikun Jiang >Priority: Major > > SPARK-33261 provides us with great convenience, but it only construct a > `KubernetesFeatureConfigStep` with a empty construction method. > It would be better to use the construction method with `KubernetesConf` (or > more detail: `KubernetesDriverConf` and `KubernetesExecutorConf`) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37145) Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer API
[ https://issues.apache.org/jira/browse/SPARK-37145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37145: -- Summary: Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer API (was: Improvement for extending pod feature steps with KubernetesConf) > Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer API > > > Key: SPARK-37145 > URL: https://issues.apache.org/jira/browse/SPARK-37145 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: wangxin201492 >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > SPARK-33261 provides us with great convenience, but it only construct a > `KubernetesFeatureConfigStep` with a empty construction method. > It would be better to use the construction method with `KubernetesConf` (or > more detail: `KubernetesDriverConf` and `KubernetesExecutorConf`) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-37771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485973#comment-17485973 ] Steve Loughran commented on SPARK-37771: [~ivan.sadikov] -any update here? > Race condition in withHiveState and limited logic in IsolatedClientLoader > result in ClassNotFoundException > -- > > Key: SPARK-37771 > URL: https://issues.apache.org/jira/browse/SPARK-37771 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.2, 3.2.0 >Reporter: Ivan Sadikov >Priority: Major > > There is a race condition between creating a Hive client and loading classes > that do not appear in shared prefixes config. For example, we confirmed that > the code fails for the following configuration: > {code:java} > spark.sql.hive.metastore.version 0.13.0 > spark.sql.hive.metastore.jars maven > spark.sql.hive.metastore.sharedPrefixes com.amazonaws prefix> > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem{code} > And code: > {code:java} > -- Prerequisite commands to set up the table > -- drop table if exists ivan_test_2; > -- create table ivan_test_2 (a int, part string) using csv location > 's3://bucket/hive-test' partitioned by (part); > -- insert into ivan_test_2 values (1, 'a'); > -- Command that triggers failure > ALTER TABLE ivan_test_2 ADD PARTITION (part='b') LOCATION > 's3://bucket/hive-test'{code} > > Stacktrace (line numbers might differ): > {code:java} > 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: > org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider > 21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: > org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider > 21/12/22 04:37:05 DEBUG IsolatedClientLoader: hive class: > com.amazonaws.auth.EnvironmentVariableCredentialsProvider - null > 21/12/22 04:37:05 ERROR S3AFileSystem: Failed to initialize S3AFileSystem for > path s3://bucket/hive-test > java.io.IOException: From option fs.s3a.aws.credentials.provider > java.lang.ClassNotFoundException: Class > com.amazonaws.auth.EnvironmentVariableCredentialsProvider not found > at > org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:725) > at > org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:688) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:411) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) > at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) > at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112) > at > org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createLocationForAddedPartition(HiveMetaStore.java:1993) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:1865) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:1910) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) > at com.sun.proxy.$Proxy58.add_partitions_req(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:457) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) > at com.sun.proxy.$Proxy59.add_partitions(Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.createPartitions(Hive.java:1514) > at > org.apache.spark.sql.hive.client.Shim_v0_13.createPartitions(HiveShim.scala:773) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createPartitions$1(HiveClientImpl.scala:683) > at
[jira] [Commented] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs
[ https://issues.apache.org/jira/browse/SPARK-28090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485917#comment-17485917 ] Apache Spark commented on SPARK-28090: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/35382 > Spark hangs when an execution plan has many projections on nested structs > - > > Key: SPARK-28090 > URL: https://issues.apache.org/jira/browse/SPARK-28090 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.4.3 > Environment: Tried in > * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows > * Spark 2.4.3 / Yarn on a Linux cluster >Reporter: Ruslan Yushchenko >Priority: Major > Labels: bulk-closed > > This was already posted (#28016), but the provided example didn't always > reproduce the error. This example consistently reproduces the issue. > Spark applications freeze on execution plan optimization stage (Catalyst) > when a logical execution plan contains a lot of projections that operate on > nested struct fields. > The code listed below demonstrates the issue. > To reproduce the Spark App does the following: > * A small dataframe is created from a JSON example. > * Several nested transformations (negation of a number) are applied on > struct fields and each time a new struct field is created. > * Once more than 9 such transformations are applied the Catalyst optimizer > freezes on optimizing the execution plan. > * You can control the freezing by choosing different upper bound for the > Range. E.g. it will work file if the upper bound is 5, but will hang is the > bound is 10. > {code:java} > package com.example > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types.{StructField, StructType} > import scala.collection.mutable.ListBuffer > object SparkApp1IssueSelfContained { > // A sample data for a dataframe with nested structs > val sample: List[String] = > """ { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, > "num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": > 110, "num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } > """ :: > """ { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, > "num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": > 210, "num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } > """ :: > """ { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, > "num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": > 310, "num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } > """ :: > Nil > /** > * Transforms a column inside a nested struct. The transformed value will > be put into a new field of that nested struct > * > * The output column name can omit the full path as the field will be > created at the same level of nesting as the input column. > * > * @param inputColumnName A column name for which to apply the > transformation, e.g. `company.employee.firstName`. > * @param outputColumnName The output column name. The path is optional, > e.g. you can use `transformedName` instead of > `company.employee.transformedName`. > * @param expression A function that applies a transformation to a > column as a Spark expression. > * @return A dataframe with a new field that contains transformed values. > */ > def transformInsideNestedStruct(df: DataFrame, > inputColumnName: String, > outputColumnName: String, > expression: Column => Column): DataFrame = { > def mapStruct(schema: StructType, path: Seq[String], parentColumn: > Option[Column] = None): Seq[Column] = { > val mappedFields = new ListBuffer[Column]() > def handleMatchedLeaf(field: StructField, curColumn: Column): > Seq[Column] = { > val newColumn = expression(curColumn).as(outputColumnName) > mappedFields += newColumn > Seq(curColumn) > } > def handleMatchedNonLeaf(field: StructField, curColumn: Column): > Seq[Column] = { > // Non-leaf columns need to be further processed recursively > field.dataType match { > case dt: StructType => Seq(struct(mapStruct(dt, path.tail, > Some(curColumn)): _*).as(field.name)) > case _ => throw new IllegalArgumentException(s"Field > '${field.name}' is not a struct type.") > } > } > val fieldName = path.head > val isLeaf = path.lengthCompare(2) < 0 > val newColumns = schema.fields.flatMap(field => { > //
[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error
[ https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Wieleba updated SPARK-38088: --- Attachment: (was: image-2022-02-02-10-08-51-144.png) > Kryo DataWritingSparkTaskResult registration error > -- > > Key: SPARK-38088 > URL: https://issues.apache.org/jira/browse/SPARK-38088 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Michał Wieleba >Priority: Major > Attachments: image-2022-02-02-10-09-14-858.png > > > Spark 3.1.2, Scala 2.12 > I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ > method. Inside there are spark structured streaming code. Following settings > are added as well: > sparkConf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > sparkConf.set("spark.kryo.registrationRequired", "true") > Unfortunately, during execution following error is thrown: > Caused by: java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult > Note: To register this class use: > kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); > > As far as I can see in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] > > class > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is > private (private[v2] case class DataWritingSparkTaskResult) therefore not > available to register. > > !image-2022-02-02-10-09-14-858.png! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error
[ https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Wieleba updated SPARK-38088: --- Description: Spark 3.1.2, Scala 2.12 I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. Inside there are spark structured streaming code. Following settings are added as well: sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryo.registrationRequired", "true") Unfortunately, during execution following error is thrown: Caused by: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult Note: To register this class use: kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); As far as I can see in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is private (private[v2] case class DataWritingSparkTaskResult) therefore not available to register. !image-2022-02-02-10-09-14-858.png! was: Spark 3.1.2, Scala 2.12 I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. Inside there are spark structured streaming code. Following settings are added as well: sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryo.registrationRequired", "true") Unfortunately, during execution following error is thrown: Caused by: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult Note: To register this class use: kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); As far as I can see in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is private (private[v2] case class DataWritingSparkTaskResult) therefore not available to register. !image-2022-02-02-10-08-51-144.png! > Kryo DataWritingSparkTaskResult registration error > -- > > Key: SPARK-38088 > URL: https://issues.apache.org/jira/browse/SPARK-38088 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Michał Wieleba >Priority: Major > Attachments: image-2022-02-02-10-08-51-144.png, > image-2022-02-02-10-09-14-858.png > > > Spark 3.1.2, Scala 2.12 > I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ > method. Inside there are spark structured streaming code. Following settings > are added as well: > sparkConf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > sparkConf.set("spark.kryo.registrationRequired", "true") > Unfortunately, during execution following error is thrown: > Caused by: java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult > Note: To register this class use: > kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); > > As far as I can see in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] > > class > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is > private (private[v2] case class DataWritingSparkTaskResult) therefore not > available to register. > > !image-2022-02-02-10-09-14-858.png! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error
[ https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Wieleba updated SPARK-38088: --- Attachment: image-2022-02-02-10-09-14-858.png > Kryo DataWritingSparkTaskResult registration error > -- > > Key: SPARK-38088 > URL: https://issues.apache.org/jira/browse/SPARK-38088 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Michał Wieleba >Priority: Major > Attachments: image-2022-02-02-10-08-51-144.png, > image-2022-02-02-10-09-14-858.png > > > Spark 3.1.2, Scala 2.12 > I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ > method. Inside there are spark structured streaming code. Following settings > are added as well: > sparkConf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > sparkConf.set("spark.kryo.registrationRequired", "true") > Unfortunately, during execution following error is thrown: > Caused by: java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult > Note: To register this class use: > kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); > > As far as I can see in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] > > class > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is > private (private[v2] case class DataWritingSparkTaskResult) therefore not > available to register. > > !image-2022-02-02-10-08-51-144.png! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error
[ https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Wieleba updated SPARK-38088: --- Attachment: image-2022-02-02-10-08-51-144.png > Kryo DataWritingSparkTaskResult registration error > -- > > Key: SPARK-38088 > URL: https://issues.apache.org/jira/browse/SPARK-38088 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Michał Wieleba >Priority: Major > Attachments: image-2022-02-02-10-08-51-144.png > > > Spark 3.1.2, Scala 2.12 > I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ > method. Inside there are spark structured streaming code. Following settings > are added as well: > sparkConf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > sparkConf.set("spark.kryo.registrationRequired", "true") > Unfortunately, during execution following error is thrown: > Caused by: java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult > Note: To register this class use: > kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); > > As far as I can see in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] > > class > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is > private (private[v2] case class DataWritingSparkTaskResult) therefore not > available to register. > > !image-2022-02-02-10-06-32-342.png! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error
[ https://issues.apache.org/jira/browse/SPARK-38088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Wieleba updated SPARK-38088: --- Description: Spark 3.1.2, Scala 2.12 I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. Inside there are spark structured streaming code. Following settings are added as well: sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryo.registrationRequired", "true") Unfortunately, during execution following error is thrown: Caused by: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult Note: To register this class use: kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); As far as I can see in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is private (private[v2] case class DataWritingSparkTaskResult) therefore not available to register. !image-2022-02-02-10-08-51-144.png! was: Spark 3.1.2, Scala 2.12 I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. Inside there are spark structured streaming code. Following settings are added as well: sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryo.registrationRequired", "true") Unfortunately, during execution following error is thrown: Caused by: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult Note: To register this class use: kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); As far as I can see in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is private (private[v2] case class DataWritingSparkTaskResult) therefore not available to register. !image-2022-02-02-10-06-32-342.png! > Kryo DataWritingSparkTaskResult registration error > -- > > Key: SPARK-38088 > URL: https://issues.apache.org/jira/browse/SPARK-38088 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Michał Wieleba >Priority: Major > Attachments: image-2022-02-02-10-08-51-144.png > > > Spark 3.1.2, Scala 2.12 > I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ > method. Inside there are spark structured streaming code. Following settings > are added as well: > sparkConf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > sparkConf.set("spark.kryo.registrationRequired", "true") > Unfortunately, during execution following error is thrown: > Caused by: java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult > Note: To register this class use: > kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); > > As far as I can see in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] > > class > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is > private (private[v2] case class DataWritingSparkTaskResult) therefore not > available to register. > > !image-2022-02-02-10-08-51-144.png! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38088) Kryo DataWritingSparkTaskResult registration error
Michał Wieleba created SPARK-38088: -- Summary: Kryo DataWritingSparkTaskResult registration error Key: SPARK-38088 URL: https://issues.apache.org/jira/browse/SPARK-38088 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Reporter: Michał Wieleba Spark 3.1.2, Scala 2.12 I'm registering classes with _sparkConf.registerKryoClasses(Array( ..._ method. Inside there are spark structured streaming code. Following settings are added as well: sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryo.registrationRequired", "true") Unfortunately, during execution following error is thrown: Caused by: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult Note: To register this class use: kryo.register(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult.class); As far as I can see in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala] class org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult is private (private[v2] case class DataWritingSparkTaskResult) therefore not available to register. !image-2022-02-02-10-06-32-342.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37908) Refactoring on pod label test in BasicFeatureStepSuite
[ https://issues.apache.org/jira/browse/SPARK-37908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37908. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35209 [https://github.com/apache/spark/pull/35209] > Refactoring on pod label test in BasicFeatureStepSuite > -- > > Key: SPARK-37908 > URL: https://issues.apache.org/jira/browse/SPARK-37908 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37908) Refactoring on pod label test in BasicFeatureStepSuite
[ https://issues.apache.org/jira/browse/SPARK-37908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37908: - Assignee: Yikun Jiang > Refactoring on pod label test in BasicFeatureStepSuite > -- > > Key: SPARK-37908 > URL: https://issues.apache.org/jira/browse/SPARK-37908 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org