[jira] [Updated] (SPARK-45421) Catch AnalysisException over InlineCTE
[ https://issues.apache.org/jira/browse/SPARK-45421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45421: --- Labels: pull-request-available (was: ) > Catch AnalysisException over InlineCTE > -- > > Key: SPARK-45421 > URL: https://issues.apache.org/jira/browse/SPARK-45421 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45421) Catch AnalysisException over InlineCTE
Rui Wang created SPARK-45421: Summary: Catch AnalysisException over InlineCTE Key: SPARK-45421 URL: https://issues.apache.org/jira/browse/SPARK-45421 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45420) Add DataType.fromDDL into PySpark
Hyukjin Kwon created SPARK-45420: Summary: Add DataType.fromDDL into PySpark Key: SPARK-45420 URL: https://issues.apache.org/jira/browse/SPARK-45420 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon Same feature as DataTyep.fromDDL in Scala. This is quite used often. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45417) Make InheritableThread inherit the active session
[ https://issues.apache.org/jira/browse/SPARK-45417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45417: --- Labels: pull-request-available (was: ) > Make InheritableThread inherit the active session > - > > Key: SPARK-45417 > URL: https://issues.apache.org/jira/browse/SPARK-45417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Chungmin Lee >Priority: Major > Labels: pull-request-available > > Repro: > {code:java} > # repro.py > from multiprocessing.pool import ThreadPool > from pyspark import inheritable_thread_target > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Test").getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > def f(i, spark): > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > print(f"{i} local property foo = > {spark.sparkContext.getLocalProperty('foo')}") > spark = SparkSession.builder.appName("Test").getOrCreate() > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > pool = ThreadPool(4) > spark.sparkContext.setLocalProperty("foo", "bar") > pool.starmap(inheritable_thread_target(f), [(i, spark) for i in > range(4)]){code} > Run as: {{./bin/spark-submit repro.py}} > {{getOrCreate()}} doesn't set the active session either. The only way is > calling the Java function directly: > {{spark._jsparkSession.setActiveSession(spark._jsparkSession)}}. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45417) Make InheritableThread inherit the active session
[ https://issues.apache.org/jira/browse/SPARK-45417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chungmin Lee updated SPARK-45417: - Description: Repro: {code:java} # repro.py from multiprocessing.pool import ThreadPool from pyspark import inheritable_thread_target from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test").getOrCreate() spark.sparkContext.setLogLevel("ERROR") def f(i, spark): print(f"{i} spark = {spark}") print(f"{i} active session = {SparkSession.getActiveSession()}") print(f"{i} local property foo = {spark.sparkContext.getLocalProperty('foo')}") spark = SparkSession.builder.appName("Test").getOrCreate() print(f"{i} spark = {spark}") print(f"{i} active session = {SparkSession.getActiveSession()}") pool = ThreadPool(4) spark.sparkContext.setLocalProperty("foo", "bar") pool.starmap(inheritable_thread_target(f), [(i, spark) for i in range(4)]){code} Run as: {{./bin/spark-submit repro.py}} {{getOrCreate()}} doesn't set the active session either. The only way is calling the Java function directly: {{spark._jsparkSession.setActiveSession(spark._jsparkSession)}}. was: Repro: {code:java} from multiprocessing.pool import ThreadPool from pyspark import inheritable_thread_target from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test").getOrCreate() spark.sparkContext.setLogLevel("ERROR") def f(i, spark): print(f"{i} spark = {spark}") print(f"{i} active session = {SparkSession.getActiveSession()}") print(f"{i} local property foo = {spark.sparkContext.getLocalProperty('foo')}") spark = SparkSession.builder.appName("Test").getOrCreate() print(f"{i} spark = {spark}") print(f"{i} active session = {SparkSession.getActiveSession()}") pool = ThreadPool(4) spark.sparkContext.setLocalProperty("foo", "bar") pool.starmap(inheritable_thread_target(f), [(i, spark) for i in range(4)]){code} {{getOrCreate()}} doesn't set the active session either. The only way is calling the Java function directly: {{spark._jsparkSession.setActiveSession(spark._jsparkSession)}}. > Make InheritableThread inherit the active session > - > > Key: SPARK-45417 > URL: https://issues.apache.org/jira/browse/SPARK-45417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Chungmin Lee >Priority: Major > > Repro: > {code:java} > # repro.py > from multiprocessing.pool import ThreadPool > from pyspark import inheritable_thread_target > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Test").getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > def f(i, spark): > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > print(f"{i} local property foo = > {spark.sparkContext.getLocalProperty('foo')}") > spark = SparkSession.builder.appName("Test").getOrCreate() > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > pool = ThreadPool(4) > spark.sparkContext.setLocalProperty("foo", "bar") > pool.starmap(inheritable_thread_target(f), [(i, spark) for i in > range(4)]){code} > Run as: {{./bin/spark-submit repro.py}} > {{getOrCreate()}} doesn't set the active session either. The only way is > calling the Java function directly: > {{spark._jsparkSession.setActiveSession(spark._jsparkSession)}}. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45417) Make InheritableThread inherit the active session
[ https://issues.apache.org/jira/browse/SPARK-45417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chungmin Lee updated SPARK-45417: - Summary: Make InheritableThread inherit the active session (was: Make InheritableThread inherit active session) > Make InheritableThread inherit the active session > - > > Key: SPARK-45417 > URL: https://issues.apache.org/jira/browse/SPARK-45417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Chungmin Lee >Priority: Major > > Repro: > {code:java} > from multiprocessing.pool import ThreadPool > from pyspark import inheritable_thread_target > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Test").getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > def f(i, spark): > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > print(f"{i} local property foo = > {spark.sparkContext.getLocalProperty('foo')}") > spark = SparkSession.builder.appName("Test").getOrCreate() > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > pool = ThreadPool(4) > spark.sparkContext.setLocalProperty("foo", "bar") > pool.starmap(inheritable_thread_target(f), [(i, spark) for i in > range(4)]){code} > {{getOrCreate()}} doesn't set the active session either. The only way is > calling the Java function directly: > {{spark._jsparkSession.setActiveSession(spark._jsparkSession)}}. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45417) Make InheritableThread inherit active session
[ https://issues.apache.org/jira/browse/SPARK-45417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chungmin Lee updated SPARK-45417: - Summary: Make InheritableThread inherit active session (was: InheritableThread doesn't inherit active session) > Make InheritableThread inherit active session > - > > Key: SPARK-45417 > URL: https://issues.apache.org/jira/browse/SPARK-45417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Chungmin Lee >Priority: Major > > Repro: > {code:java} > from multiprocessing.pool import ThreadPool > from pyspark import inheritable_thread_target > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Test").getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > def f(i, spark): > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > print(f"{i} local property foo = > {spark.sparkContext.getLocalProperty('foo')}") > spark = SparkSession.builder.appName("Test").getOrCreate() > print(f"{i} spark = {spark}") > print(f"{i} active session = {SparkSession.getActiveSession()}") > pool = ThreadPool(4) > spark.sparkContext.setLocalProperty("foo", "bar") > pool.starmap(inheritable_thread_target(f), [(i, spark) for i in > range(4)]){code} > {{getOrCreate()}} doesn't set the active session either. The only way is > calling the Java function directly: > {{spark._jsparkSession.setActiveSession(spark._jsparkSession)}}. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45386) Correctness issue when persisting using StorageLevel.NONE
[ https://issues.apache.org/jira/browse/SPARK-45386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45386. -- Fix Version/s: 3.5.1 Resolution: Fixed Issue resolved by pull request 43213 [https://github.com/apache/spark/pull/43213] > Correctness issue when persisting using StorageLevel.NONE > - > > Key: SPARK-45386 > URL: https://issues.apache.org/jira/browse/SPARK-45386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1 > > > When using spark 3.5.0 this code > {code:java} > import org.apache.spark.storage.StorageLevel > spark.createDataset(Seq(1,2,3)).persist(StorageLevel.NONE).count() {code} > incorrectly returns 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45386) Correctness issue when persisting using StorageLevel.NONE
[ https://issues.apache.org/jira/browse/SPARK-45386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45386: Assignee: Emil Ejbyfeldt > Correctness issue when persisting using StorageLevel.NONE > - > > Key: SPARK-45386 > URL: https://issues.apache.org/jira/browse/SPARK-45386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > Labels: pull-request-available > > When using spark 3.5.0 this code > {code:java} > import org.apache.spark.storage.StorageLevel > spark.createDataset(Seq(1,2,3)).persist(StorageLevel.NONE).count() {code} > incorrectly returns 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45412) Validate the plan and session in DataFrame.__init__
[ https://issues.apache.org/jira/browse/SPARK-45412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45412. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43215 [https://github.com/apache/spark/pull/43215] > Validate the plan and session in DataFrame.__init__ > --- > > Key: SPARK-45412 > URL: https://issues.apache.org/jira/browse/SPARK-45412 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45412) Validate the plan and session in DataFrame.__init__
[ https://issues.apache.org/jira/browse/SPARK-45412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45412: Assignee: Ruifeng Zheng > Validate the plan and session in DataFrame.__init__ > --- > > Key: SPARK-45412 > URL: https://issues.apache.org/jira/browse/SPARK-45412 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43657) reuse SPARK_CONF_DIR config maps between driver and executor
[ https://issues.apache.org/jira/browse/SPARK-43657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43657: --- Labels: pull-request-available (was: ) > reuse SPARK_CONF_DIR config maps between driver and executor > > > Key: SPARK-43657 > URL: https://issues.apache.org/jira/browse/SPARK-43657 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.4, 3.3.2, 3.4.0 >Reporter: YE >Priority: Major > Labels: pull-request-available > > Currently, Spark on k8s-cluster creates two config maps per application: one > for the driver and another for the executor. However the config map for > executor is almost identical for config map for driver, there's no need to > create there two duplicate config maps. As ConfigMaps are object on K8S, > there would be some limitations for ConfigMaps on K8S: > # more config maps means more objects on etcd, and adds overhead to API > server > # Spark driver pod might be ran under limited permission, which means, it > might not be possible to create resources rather than exec pod. Therefore > driver might not be allowed to create config maps. > I would submit a pr to reuse SPARK_CONF_DIR config maps for running spark on > k8s-cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42629) Update the description of default data source in the document
[ https://issues.apache.org/jira/browse/SPARK-42629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42629: --- Labels: pull-request-available (was: ) > Update the description of default data source in the document > - > > Key: SPARK-42629 > URL: https://issues.apache.org/jira/browse/SPARK-42629 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: xiaoping.huang >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42727) Support executing spark commands in the root directory when local mode is specified
[ https://issues.apache.org/jira/browse/SPARK-42727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42727: --- Labels: pull-request-available (was: ) > Support executing spark commands in the root directory when local mode is > specified > --- > > Key: SPARK-42727 > URL: https://issues.apache.org/jira/browse/SPARK-42727 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: xiaoping.huang >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44025) CSV Table Read Error with CharType(length) column
[ https://issues.apache.org/jira/browse/SPARK-44025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44025: --- Labels: pull-request-available (was: ) > CSV Table Read Error with CharType(length) column > - > > Key: SPARK-44025 > URL: https://issues.apache.org/jira/browse/SPARK-44025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: {{apache/spark:v3.4.0 image}} >Reporter: Fengyu Cao >Priority: Major > Labels: pull-request-available > > Problem: > # read a CSV format table > # table has a `CharType(length)` column > # read table failed with Exception: `org.apache.spark.SparkException: Job > aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 36.0 (TID 72) (10.113.9.208 executor > 11): java.lang.IllegalArgumentException: requirement failed: requiredSchema > (struct) should be the subset of dataSchema > (struct).` > > reproduce with official image: > # {{docker run -it apache/spark:v3.4.0 /opt/spark/bin/spark-sql}} > # {{CREATE TABLE csv_bug (name STRING, age INT, job CHAR(4)) USING CSV > OPTIONS ('header' = 'true', 'sep' = ';') LOCATION > "/opt/spark/examples/src/main/resources/people.csv";}} > # SELECT * FROM csv_bug; > # ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: requiredSchema > (struct) should be the subset of dataSchema > (struct). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43255) Assign a name to the error class _LEGACY_ERROR_TEMP_2020
[ https://issues.apache.org/jira/browse/SPARK-43255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43255: --- Labels: pull-request-available starter (was: starter) > Assign a name to the error class _LEGACY_ERROR_TEMP_2020 > > > Key: SPARK-43255 > URL: https://issues.apache.org/jira/browse/SPARK-43255 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2020* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44082) Generate operator does not update reference set properly
[ https://issues.apache.org/jira/browse/SPARK-44082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44082: --- Labels: pull-request-available (was: ) > Generate operator does not update reference set properly > > > Key: SPARK-44082 > URL: https://issues.apache.org/jira/browse/SPARK-44082 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > > Before > ``` > == Optimized Logical Plan == > Project [col1#2, col2#19] > +- Generate replicaterows(sum#17L, col1#2, col2#3), [2], false, [col1#2, > col2#3] >+- Filter (isnotnull(sum#17L) AND (sum#17L > 0)) > +- Aggregate [col1#2, col2#19], [col1#2, col2#19, sum(vcol#14L) AS > sum#17L] > +- Union false, false > :- Aggregate [col1#2], [1 AS vcol#14L, col1#2, first(col2#3, > false) AS col2#19] > : +- LogicalRDD [col1#2, col2#3], false > +- Project [-1 AS vcol#15L, col1#8, col2#9] >+- LogicalRDD [col1#8, col2#9], false > ``` > Couldn't find col2#3 in [col1#2,col2#19,sum#17L] > after > ``` > == Optimized Logical Plan == > Project [col1#2, col2#19] > +- Generate replicaterows(sum#17L, col1#2, col2#19), [2], false, [col1#2, > col2#19] >+- Filter (isnotnull(sum#17L) AND (sum#17L > 0)) > +- Aggregate [col1#2, col2#19], [col1#2, col2#19, sum(vcol#14L) AS > sum#17L] > +- Union false, false > :- Aggregate [col1#2], [1 AS vcol#14L, col1#2, first(col2#3, > false) AS col2#19] > : +- LogicalRDD [col1#2, col2#3], false > +- Project [-1 AS vcol#15L, col1#8, col2#9] >+- LogicalRDD [col1#8, col2#9], false > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44137) Change handling of iterable objects for on field in joins
[ https://issues.apache.org/jira/browse/SPARK-44137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44137: --- Labels: pull-request-available (was: ) > Change handling of iterable objects for on field in joins > - > > Key: SPARK-44137 > URL: https://issues.apache.org/jira/browse/SPARK-44137 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: John Haberstroh >Priority: Minor > Labels: pull-request-available > > The {{on}} field complained when I passed it a Tuple. That's because it saw > that it checked for {{list}} exactly, and so wrapped it into a list like > {{{}[on]{}}}, leading to immediate failure. This was surprising -- typically, > tuple and list should be interchangeable, and typically tuple is the more > readily accepted type. I have proposed a change that moves towards the > principle of least surprise for this situation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45345) Refactor release-build.sh
[ https://issues.apache.org/jira/browse/SPARK-45345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45345. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43212 [https://github.com/apache/spark/pull/43212] > Refactor release-build.sh > - > > Key: SPARK-45345 > URL: https://issues.apache.org/jira/browse/SPARK-45345 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45345) Refactor release-build.sh
[ https://issues.apache.org/jira/browse/SPARK-45345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45345: Assignee: Hyukjin Kwon > Refactor release-build.sh > - > > Key: SPARK-45345 > URL: https://issues.apache.org/jira/browse/SPARK-45345 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45009) Correlated EXISTS subqueries in join ON condition unsupported and fail with internal error
[ https://issues.apache.org/jira/browse/SPARK-45009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45009: --- Labels: pull-request-available (was: ) > Correlated EXISTS subqueries in join ON condition unsupported and fail with > internal error > -- > > Key: SPARK-45009 > URL: https://issues.apache.org/jira/browse/SPARK-45009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > Labels: pull-request-available > > They are not handled in decorrelation, and we also don’t have any checks to > block them, so these queries have outer references in the query plan leading > to internal errors: > {code:java} > CREATE TEMP VIEW x(x1, x2) AS VALUES (0, 1), (1, 2); > CREATE TEMP VIEW y(y1, y2) AS VALUES (0, 2), (0, 3); > CREATE TEMP VIEW z(z1, z2) AS VALUES (0, 2), (0, 3); > select * from x left join y on x1 = y1 and exists (select * from z where z1 = > x1) > Error occurred during query planning: > org.apache.spark.sql.catalyst.plans.logical.Filter cannot be cast to > org.apache.spark.sql.execution.SparkPlan {code} > PullupCorrelatedPredicates#apply and RewritePredicateSubquery only handle > subqueries in UnaryNode, it seems to assume that they cannot occur elsewhere, > like in a join ON condition. > We will need to decide whether to block them properly in analysis (i.e. give > a proper error for them), or see if we can add support for them. > Also note, scalar subqueries in the ON condition are unsupported too but > return a proper error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45396) Add doc entry for `pyspark.ml.connect` module
[ https://issues.apache.org/jira/browse/SPARK-45396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45396. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43210 [https://github.com/apache/spark/pull/43210] > Add doc entry for `pyspark.ml.connect` module > - > > Key: SPARK-45396 > URL: https://issues.apache.org/jira/browse/SPARK-45396 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add doc entry for `pyspark.ml.connect` module -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45396) Add doc entry for `pyspark.ml.connect` module
[ https://issues.apache.org/jira/browse/SPARK-45396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45396: Assignee: Weichen Xu > Add doc entry for `pyspark.ml.connect` module > - > > Key: SPARK-45396 > URL: https://issues.apache.org/jira/browse/SPARK-45396 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Labels: pull-request-available > > Add doc entry for `pyspark.ml.connect` module -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45410) Add Python GitHub Action Daily Job
[ https://issues.apache.org/jira/browse/SPARK-45410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45410. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43209 [https://github.com/apache/spark/pull/43209] > Add Python GitHub Action Daily Job > -- > > Key: SPARK-45410 > URL: https://issues.apache.org/jira/browse/SPARK-45410 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28973) Add TimeType to Catalyst
[ https://issues.apache.org/jira/browse/SPARK-28973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-28973: --- Labels: pull-request-available (was: ) > Add TimeType to Catalyst > > > Key: SPARK-28973 > URL: https://issues.apache.org/jira/browse/SPARK-28973 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Max Gekk >Priority: Major > Labels: pull-request-available > > The time type should represent local time in microsecond precision with valid > range of values [00:00:00.00, 23:59:59.99]. Internally, time can be > stored as number of microseconds since 00:00:00.00. > Support `java.time.LocalTime` as the external type for Catalyst's TimeType. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45418) Change CURRENT_SCHEMA() column alias to match function name
[ https://issues.apache.org/jira/browse/SPARK-45418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771992#comment-17771992 ] Michael Zhang commented on SPARK-45418: --- I am currently working on this > Change CURRENT_SCHEMA() column alias to match function name > --- > > Key: SPARK-45418 > URL: https://issues.apache.org/jira/browse/SPARK-45418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Michael Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45418) Change CURRENT_SCHEMA() column alias to match function name
Michael Zhang created SPARK-45418: - Summary: Change CURRENT_SCHEMA() column alias to match function name Key: SPARK-45418 URL: https://issues.apache.org/jira/browse/SPARK-45418 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Michael Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45408) [CORE] Add RPC SSL settings to TransportConf
[ https://issues.apache.org/jira/browse/SPARK-45408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45408: --- Labels: pull-request-available (was: ) > [CORE] Add RPC SSL settings to TransportConf > > > Key: SPARK-45408 > URL: https://issues.apache.org/jira/browse/SPARK-45408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > > Add support for the settings for SSL RPC support to TransportConf and some > associated tests + sample configs used by other tests -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45416) Sanity check that Spark Connect returns arrow batches in order
[ https://issues.apache.org/jira/browse/SPARK-45416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45416: --- Labels: pull-request-available (was: ) > Sanity check that Spark Connect returns arrow batches in order > -- > > Key: SPARK-45416 > URL: https://issues.apache.org/jira/browse/SPARK-45416 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Juliusz Sompolski >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45417) InheritableThread doesn't inherit active session
Chungmin Lee created SPARK-45417: Summary: InheritableThread doesn't inherit active session Key: SPARK-45417 URL: https://issues.apache.org/jira/browse/SPARK-45417 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.0 Reporter: Chungmin Lee Repro: {code:java} from multiprocessing.pool import ThreadPool from pyspark import inheritable_thread_target from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Test").getOrCreate() spark.sparkContext.setLogLevel("ERROR") def f(i, spark): print(f"{i} spark = {spark}") print(f"{i} active session = {SparkSession.getActiveSession()}") print(f"{i} local property foo = {spark.sparkContext.getLocalProperty('foo')}") spark = SparkSession.builder.appName("Test").getOrCreate() print(f"{i} spark = {spark}") print(f"{i} active session = {SparkSession.getActiveSession()}") pool = ThreadPool(4) spark.sparkContext.setLocalProperty("foo", "bar") pool.starmap(inheritable_thread_target(f), [(i, spark) for i in range(4)]){code} {{getOrCreate()}} doesn't set the active session either. The only way is calling the Java function directly: {{spark._jsparkSession.setActiveSession(spark._jsparkSession)}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45416) Sanity check that Spark Connect returns arrow batches in order
Juliusz Sompolski created SPARK-45416: - Summary: Sanity check that Spark Connect returns arrow batches in order Key: SPARK-45416 URL: https://issues.apache.org/jira/browse/SPARK-45416 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Juliusz Sompolski -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45391) spark-connect-repl is not working on macOS
[ https://issues.apache.org/jira/browse/SPARK-45391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771928#comment-17771928 ] Michael Baker commented on SPARK-45391: --- I'm getting this same error after installing spark-connect-repl in an arm64 ubuntu:focal Docker container. > spark-connect-repl is not working on macOS > -- > > Key: SPARK-45391 > URL: https://issues.apache.org/jira/browse/SPARK-45391 > Project: Spark > Issue Type: Bug > Components: Connect Contrib >Affects Versions: 3.5.0 > Environment: MacBook M2 > cs version > 2.1.7 > scala -version > Scala code runner version 2.12.18 -- Copyright 2002-2023, LAMP/EPFL and > Lightbend, Inc. >Reporter: Vu Tan >Priority: Major > > I followed > [https://spark.apache.org/docs/latest/spark-connect-overview.html#use-spark-connect-for-interactive-analysis] > to try spark-connect-repl on my local PC but got the following error: > > --- > spark-connect-repl > Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/connect/client/com/google/common/io/BaseEncoding > at > org.sparkproject.connect.client.io.grpc.Metadata.(Metadata.java:114) > at > org.apache.spark.sql.connect.client.SparkConnectClient$.(SparkConnectClient.scala:329) > at > org.apache.spark.sql.connect.client.SparkConnectClient$.(SparkConnectClient.scala) > at > org.apache.spark.sql.application.ConnectRepl$.doMain(ConnectRepl.scala:61) > at > org.apache.spark.sql.application.ConnectRepl$.main(ConnectRepl.scala:50) > at org.apache.spark.sql.application.ConnectRepl.main(ConnectRepl.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at coursier.bootstrap.launcher.a.a(Unknown Source) > at coursier.bootstrap.launcher.Launcher.main(Unknown Source) > Caused by: java.lang.ClassNotFoundException: > org.sparkproject.connect.client.com.google.common.io.BaseEncoding > at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > ... 12 more > --- > > Do you have any idea why this is happening and how to solve it? > Thank you. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert
[ https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771921#comment-17771921 ] Hudson commented on SPARK-38200: User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/41518 > [SQL] Spark JDBC Savemode Supports Upsert > - > > Key: SPARK-38200 > URL: https://issues.apache.org/jira/browse/SPARK-38200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > upsert sql for different databases, Most databases support merge sql: > sqlserver merge into sql : > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] > mysql: > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] > oracle merge into sql : > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] > postgres: > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] > postgres merg into sql : > [https://www.postgresql.org/docs/current/sql-merge.html] > db2 merge into sql : > [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] > derby merge into sql: > [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] > he merg into sql : > [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] > > [~yao] > > https://github.com/melin/datatunnel/tree/master/plugins/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/dialect > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43991) Use the value of spark.eventLog.compression.codec set by user when write compact file
[ https://issues.apache.org/jira/browse/SPARK-43991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771920#comment-17771920 ] Hudson commented on SPARK-43991: User 'shuyouZZ' has created a pull request for this issue: https://github.com/apache/spark/pull/41491 > Use the value of spark.eventLog.compression.codec set by user when write > compact file > - > > Key: SPARK-43991 > URL: https://issues.apache.org/jira/browse/SPARK-43991 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Priority: Major > > Currently, if enable rolling log in SHS, only {{originalFilePath}} is used to > determine the path of compact file. > {code:java} > override val logPath: String = originalFilePath.toUri.toString + > EventLogFileWriter.COMPACTED > {code} > If the user set {{spark.eventLog.compression.codec}} in sparkConf and it is > different from the default value of spark conf, when the log compact logic is > triggered, the old event log file will be compacted and use the default value > of spark conf. > {code:java} > protected val compressionCodec = > if (shouldCompress) { > Some(CompressionCodec.createCodec(sparkConf, > sparkConf.get(EVENT_LOG_COMPRESSION_CODEC))) > } else { > None > } > private[history] val compressionCodecName = compressionCodec.map { c => > CompressionCodec.getShortName(c.getClass.getName) > } > {code} > However, The compression codec used by EventLogFileReader to read log is > split from the log path, this will lead to EventLogFileReader can not read > the compacted log file normally. > {code:java} > def codecName(log: Path): Option[String] = { > // Compression codec is encoded as an extension, e.g. app_123.lzf > // Since we sanitize the app ID to not include periods, it is safe to > split on it > val logName = log.getName.stripSuffix(COMPACTED).stripSuffix(IN_PROGRESS) > logName.split("\\.").tail.lastOption > } > {code} > So we should override the {{shouldCompress}} and {{compressionCodec}} > variable in class {{{}CompactedEventLogFileWriter{}}}, use the compression > codec set by the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45404) Support AWS_ENDPOINT_URL env variable
[ https://issues.apache.org/jira/browse/SPARK-45404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45404. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43205 [https://github.com/apache/spark/pull/43205] > Support AWS_ENDPOINT_URL env variable > - > > Key: SPARK-45404 > URL: https://issues.apache.org/jira/browse/SPARK-45404 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45404) Support AWS_ENDPOINT_URL env variable
[ https://issues.apache.org/jira/browse/SPARK-45404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45404: - Assignee: Dongjoon Hyun > Support AWS_ENDPOINT_URL env variable > - > > Key: SPARK-45404 > URL: https://issues.apache.org/jira/browse/SPARK-45404 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44735) Log a warning when inserting columns with the same name by row that don't match up
[ https://issues.apache.org/jira/browse/SPARK-44735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44735: --- Labels: pull-request-available (was: ) > Log a warning when inserting columns with the same name by row that don't > match up > -- > > Key: SPARK-44735 > URL: https://issues.apache.org/jira/browse/SPARK-44735 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Holden Karau >Priority: Minor > Labels: pull-request-available > > With SPARK-42750 people can now insert by name, but sometimes people forget > it. We should log warning when it *looks like* someone forgot it (e.g. insert > by column number with all the same names *but* not matching up in row). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45413) Add warning for prepare drop LevelDB support
[ https://issues.apache.org/jira/browse/SPARK-45413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Fan updated SPARK-45413: Summary: Add warning for prepare drop LevelDB support (was: Drop leveldb support for `spark.history.store.hybridStore.diskBackend`) > Add warning for prepare drop LevelDB support > > > Key: SPARK-45413 > URL: https://issues.apache.org/jira/browse/SPARK-45413 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jia Fan >Priority: Major > Labels: pull-request-available > > Remove leveldb support for `spark.history.store.hybridStore.diskBackend` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by Sean Owen [[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [[https://github.com/srowen/spark-xml]] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]). h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails with the following error. _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._ Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python} fake_file_df =
[jira] [Reopened] (SPARK-45088) Make `getitem` work with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-45088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-45088: --- Assignee: (was: Ruifeng Zheng) > Make `getitem` work with duplicated columns > --- > > Key: SPARK-45088 > URL: https://issues.apache.org/jira/browse/SPARK-45088 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails with the following error. _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._ Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python}
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails with the following error. _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._ Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python}
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails with the following error. _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._ Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python}
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails with the following error. _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._ Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python}
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE:
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error. {code:java} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:java} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:java} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error. {code:java} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:java} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error. {code:java} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:java} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error. {code:java} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice,
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error. ```py fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) ``` I noticed that it works if I try to write all columns up to "Color" (excluded), namely: fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST(STRUCT(NULL
[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content
[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Ceravolo updated SPARK-45414: -- Attachment: IllegalArgumentException.txt > spark-xml misplaces string tag content > -- > > Key: SPARK-45414 > URL: https://issues.apache.org/jira/browse/SPARK-45414 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.3.0 >Reporter: Giuseppe Ceravolo >Priority: Critical > Attachments: IllegalArgumentException.txt > > > h1. Intro > Hi all! Please expect some degree of incompleteness in this issue as this is > the very first I post, and feel free to edit it as you like - I welcome your > feedback. > My goal is to provide you with as many details and indications as I can on > this issue that I am currently facing with a Client of mine on its Production > environment (we use Azure Databricks DBR 11.3 LTS). > I was told by [Sean Owen|[srowen (Sean Owen) > (github.com)|https://github.com/srowen]], who maintains the spark-xml maven > repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an > issue here because "This code has been ported to Apache Spark now anyway so > won't be updated here" (refer to his comment > [here|[https://github.com/databricks/spark-xml/issues/431#issuecomment-1744792958]).] > h1. Issue > When I write a DataFrame into xml format via the spark-xml library either (1) > I get an error if empty string columns are in between non-string nested ones > or (2) if I put all string columns at the end then I get a wrong xml where > the content of string tags are misplaced into the following ones. > h1. Code to reproduce the issue > Please find below the end-to-end code snippet that results into the error > h2. CASE (1): ERROR > When empty strings are in between non-string nested ones, the write fails. > Please find attached the full trace of the error. > fake_file_df = spark \ > .sql("""SELECTCAST(STRUCT('ItemId' AS `_Type`, '123' > AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, > CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, > _VALUE: STRING>) AS UPC,CAST('' AS STRING) AS _SerialNumberFlag, > CAST('MyDescription' AS STRING) AS Description, > CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY STRING, _Level: STRING>>) AS MerchandiseHierarchy, > CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS > ARRAY>) AS ItemPrice, > CAST('' AS STRING) AS Color,CAST('' AS STRING) AS > IntendedIndustry,CAST(STRUCT(NULL AS `Name`) AS STRUCT STRING>) AS Manufacturer,CAST(STRUCT(NULL AS `Season`) AS > STRUCT) AS Marketing,CAST(STRUCT(NULL AS `_Name`) > AS STRUCT<_Name: STRING>) AS BrandOwner, > CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS > ARRAY>) AS > ItemAttribute_culinary,CAST(ARRAY(STRUCT(NULL AS `_Name`, > ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS > `_VALUE`)) AS ARRAY ARRAY) AS > ItemAttribute_noculinary,CAST(STRUCT(STRUCT(NULL AS > `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS > `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS > `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS > `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: > STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: > STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: > STRING>>) AS ItemMeasurements,CAST(STRUCT('GroupA' AS > `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS > STRUCT) AS > TaxInformation,CAST('' AS STRING) AS ItemImageUrl, > CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS > `_franchiseeName`))) AS ARRAY STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,CAST('Add' > AS STRING) AS _Action;""")# fake_file_df.display()fake_file_df \ > .coalesce(1) \ > .write \ > .format('com.databricks.spark.xml') \ > .option('declaration', 'version="1.0" encoding="UTF-8"') \ > .option("nullValue", "") \ > .option('rootTag', "root_tag") \ > .option('rowTag', "row_tag") \ > .mode('overwrite') \ > .save(xml_folder_path) > I noticed that it works if I try to write all columns up to "Color" > (excluded), namely: > fake_file_df \ > .select("ItemID","UPC","_SerialNumberFlag", > "Description","MerchandiseHierarchy","ItemPrice") \ > .coalesce(1) \ >
[jira] [Created] (SPARK-45414) spark-xml misplaces string tag content
Giuseppe Ceravolo created SPARK-45414: - Summary: spark-xml misplaces string tag content Key: SPARK-45414 URL: https://issues.apache.org/jira/browse/SPARK-45414 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 3.3.0 Reporter: Giuseppe Ceravolo h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|[https://github.com/databricks/spark-xml/issues/431#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails. Please find attached the full trace of the error. fake_file_df = spark \ .sql("""SELECTCAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC,CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY>) AS ItemPrice, CAST('' AS STRING) AS Color,CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT) AS Manufacturer,CAST(STRUCT(NULL AS `Season`) AS STRUCT) AS Marketing,CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner,CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY>) AS ItemAttribute_culinary,CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY) AS ItemAttribute_noculinary,CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT) AS TaxInformation,CAST('' AS STRING) AS ItemImageUrl,CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action;""")# fake_file_df.display()fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) I noticed that it works if I try to write all columns up to "Color" (excluded), namely: fake_file_df \ .select("ItemID","UPC","_SerialNumberFlag", "Description","MerchandiseHierarchy","ItemPrice") \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. fake_file_df = spark \ .sql("""SELECTCAST(STRUCT('ItemId' AS `_Type`, '123' AS
[jira] [Commented] (SPARK-45093) AddArtifacts should give proper error messages if it fails
[ https://issues.apache.org/jira/browse/SPARK-45093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771836#comment-17771836 ] Nikita Awasthi commented on SPARK-45093: User 'cdkrot' has created a pull request for this issue: https://github.com/apache/spark/pull/43216 > AddArtifacts should give proper error messages if it fails > -- > > Key: SPARK-45093 > URL: https://issues.apache.org/jira/browse/SPARK-45093 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Alice Sayutina >Assignee: Alice Sayutina >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > I've been trying to do some testing of udf's using code in other module, so > that AddArtifact is necessary. > > I got the following error: > > > {code:java} > Traceback (most recent call last): > File "/Users/alice.sayutina/db-connect-playground/udf2.py", line 5, in > > spark.addArtifacts("udf2_support.py", pyfile=True) > File > "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/session.py", > line 744, in addArtifacts > self._client.add_artifacts(*path, pyfile=pyfile, archive=archive, > file=file) > File > "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py", > line 1582, in add_artifacts > self._artifact_manager.add_artifacts(*path, pyfile=pyfile, > archive=archive, file=file) > File > "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py", > line 283, in add_artifacts > self._request_add_artifacts(requests) > File > "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py", > line 259, in _request_add_artifacts > response: proto.AddArtifactsResponse = self._retrieve_responses(requests) > File > "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/pyspark/sql/connect/client/artifact.py", > line 256, in _retrieve_responses > return self._stub.AddArtifacts(requests, metadata=self._metadata) > File > "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py", > line 1246, in __call__ > return _end_unary_response_blocking(state, call, False, None) > File > "/Users/alice.sayutina/db-connect-venv/lib/python3.10/site-packages/grpc/_channel.py", > line 910, in _end_unary_response_blocking > raise _InactiveRpcError(state) # pytype: disable=not-instantiable > grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated > with: > status = StatusCode.UNKNOWN > details = "Exception iterating requests!" > debug_error_string = "None" > {code} > > Which doesn't give any clue about what happens. > Only after noticeable investigation I found the problem: I'm specifying the > wrong path and the artifact fails to upload. Specifically what happens is > that ArtifactManager doesn't read the file immediately, but rather creates > iterator object which will incrementally generate requests to send. This > iterator is passed to grpc's stream_unary to consume and actually send, and > while grpc catches the error (see above), it suppresses the underlying > exception. > I think we should improve pyspark user experience. One of the possible ways > to fix this is to wrap ArtifactsManager._create_requests with an iterator > wrapper which would log the throwable into spark connect logger so that user > would see something like below at least when the debug mode is on. > > {code:java} > FileNotFoundError: [Errno 2] No such file or directory: > '/Users/alice.sayutina/udf2_support.py' {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45413) Drop leveldb support for `spark.history.store.hybridStore.diskBackend`
[ https://issues.apache.org/jira/browse/SPARK-45413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45413: --- Labels: pull-request-available (was: ) > Drop leveldb support for `spark.history.store.hybridStore.diskBackend` > -- > > Key: SPARK-45413 > URL: https://issues.apache.org/jira/browse/SPARK-45413 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jia Fan >Priority: Major > Labels: pull-request-available > > Remove leveldb support for `spark.history.store.hybridStore.diskBackend` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45394) Retry handling for add_artifact
[ https://issues.apache.org/jira/browse/SPARK-45394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45394: --- Labels: pull-request-available (was: ) > Retry handling for add_artifact > --- > > Key: SPARK-45394 > URL: https://issues.apache.org/jira/browse/SPARK-45394 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Alice Sayutina >Priority: Major > Labels: pull-request-available > > There is no retry handling within add_artifact -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44223) Drop leveldb support
[ https://issues.apache.org/jira/browse/SPARK-44223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771824#comment-17771824 ] Jia Fan commented on SPARK-44223: - Drop leveldb support for `spark.shuffle.service.db.backend` will be created after 4.0.0 released. > Drop leveldb support > > > Key: SPARK-44223 > URL: https://issues.apache.org/jira/browse/SPARK-44223 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > The leveldb project seems to be no longer maintained, and we can always > replace it with rocksdb. I think we can remove support and dependencies on > leveldb in Spark 4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45413) Drop leveldb support for `spark.history.store.hybridStore.diskBackend`
Jia Fan created SPARK-45413: --- Summary: Drop leveldb support for `spark.history.store.hybridStore.diskBackend` Key: SPARK-45413 URL: https://issues.apache.org/jira/browse/SPARK-45413 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Jia Fan Remove leveldb support for `spark.history.store.hybridStore.diskBackend` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45412) Validate the plan and session in DataFrame.__init__
[ https://issues.apache.org/jira/browse/SPARK-45412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45412: --- Labels: pull-request-available (was: ) > Validate the plan and session in DataFrame.__init__ > --- > > Key: SPARK-45412 > URL: https://issues.apache.org/jira/browse/SPARK-45412 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45412) Validate the plan and session in DataFrame.__init__
Ruifeng Zheng created SPARK-45412: - Summary: Validate the plan and session in DataFrame.__init__ Key: SPARK-45412 URL: https://issues.apache.org/jira/browse/SPARK-45412 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45355: - Assignee: Ruifeng Zheng > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45355. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43148 [https://github.com/apache/spark/pull/43148] > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43656) Fix pyspark.sql.column._to_java_column to accept Connect Column
[ https://issues.apache.org/jira/browse/SPARK-43656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43656: --- Labels: pull-request-available (was: ) > Fix pyspark.sql.column._to_java_column to accept Connect Column > --- > > Key: SPARK-43656 > URL: https://issues.apache.org/jira/browse/SPARK-43656 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Run `NumPyCompatParityTests.test_np_spark_compat_frame` to repro. > ` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45386) Correctness issue when persisting using StorageLevel.NONE
[ https://issues.apache.org/jira/browse/SPARK-45386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45386: -- Assignee: (was: Apache Spark) > Correctness issue when persisting using StorageLevel.NONE > - > > Key: SPARK-45386 > URL: https://issues.apache.org/jira/browse/SPARK-45386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Emil Ejbyfeldt >Priority: Major > Labels: pull-request-available > > When using spark 3.5.0 this code > {code:java} > import org.apache.spark.storage.StorageLevel > spark.createDataset(Seq(1,2,3)).persist(StorageLevel.NONE).count() {code} > incorrectly returns 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45386) Correctness issue when persisting using StorageLevel.NONE
[ https://issues.apache.org/jira/browse/SPARK-45386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45386: -- Assignee: Apache Spark > Correctness issue when persisting using StorageLevel.NONE > - > > Key: SPARK-45386 > URL: https://issues.apache.org/jira/browse/SPARK-45386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Emil Ejbyfeldt >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > When using spark 3.5.0 this code > {code:java} > import org.apache.spark.storage.StorageLevel > spark.createDataset(Seq(1,2,3)).persist(StorageLevel.NONE).count() {code} > incorrectly returns 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45386) Correctness issue when persisting using StorageLevel.NONE
[ https://issues.apache.org/jira/browse/SPARK-45386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45386: -- Assignee: Apache Spark > Correctness issue when persisting using StorageLevel.NONE > - > > Key: SPARK-45386 > URL: https://issues.apache.org/jira/browse/SPARK-45386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Emil Ejbyfeldt >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > When using spark 3.5.0 this code > {code:java} > import org.apache.spark.storage.StorageLevel > spark.createDataset(Seq(1,2,3)).persist(StorageLevel.NONE).count() {code} > incorrectly returns 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45386) Correctness issue when persisting using StorageLevel.NONE
[ https://issues.apache.org/jira/browse/SPARK-45386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45386: -- Assignee: (was: Apache Spark) > Correctness issue when persisting using StorageLevel.NONE > - > > Key: SPARK-45386 > URL: https://issues.apache.org/jira/browse/SPARK-45386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Emil Ejbyfeldt >Priority: Major > Labels: pull-request-available > > When using spark 3.5.0 this code > {code:java} > import org.apache.spark.storage.StorageLevel > spark.createDataset(Seq(1,2,3)).persist(StorageLevel.NONE).count() {code} > incorrectly returns 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45355: -- Assignee: (was: Apache Spark) > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45355: -- Assignee: Apache Spark > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45345) Refactor release-build.sh
[ https://issues.apache.org/jira/browse/SPARK-45345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45345: -- Assignee: Apache Spark > Refactor release-build.sh > - > > Key: SPARK-45345 > URL: https://issues.apache.org/jira/browse/SPARK-45345 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45345) Refactor release-build.sh
[ https://issues.apache.org/jira/browse/SPARK-45345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45345: -- Assignee: (was: Apache Spark) > Refactor release-build.sh > - > > Key: SPARK-45345 > URL: https://issues.apache.org/jira/browse/SPARK-45345 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-45411) Fix snapshot build in publish_snapshot.yml
[ https://issues.apache.org/jira/browse/SPARK-45411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon deleted SPARK-45411: - > Fix snapshot build in publish_snapshot.yml > -- > > Key: SPARK-45411 > URL: https://issues.apache.org/jira/browse/SPARK-45411 > Project: Spark > Issue Type: Sub-task >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/actions/workflows/publish_snapshot.yml > snapshots being failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45411) Fix snapshot build in publish_snapshot.yml
Hyukjin Kwon created SPARK-45411: Summary: Fix snapshot build in publish_snapshot.yml Key: SPARK-45411 URL: https://issues.apache.org/jira/browse/SPARK-45411 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Hyukjin Kwon https://github.com/apache/spark/actions/workflows/publish_snapshot.yml snapshots being failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45345) Refactor release-build.sh
[ https://issues.apache.org/jira/browse/SPARK-45345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45345: --- Labels: pull-request-available (was: ) > Refactor release-build.sh > - > > Key: SPARK-45345 > URL: https://issues.apache.org/jira/browse/SPARK-45345 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45398) Include `ESCAPE` to `sql()` of `Like`
[ https://issues.apache.org/jira/browse/SPARK-45398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45398. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43196 [https://github.com/apache/spark/pull/43196] > Include `ESCAPE` to `sql()` of `Like` > - > > Key: SPARK-45398 > URL: https://issues.apache.org/jira/browse/SPARK-45398 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Fix the `sql()` method of the `Like` expression and append the `ESCAPE` > closure. That should become consistent to `toString` and fix the issue: > {code:sql} > spark-sql (default)> create temp view tbl as (SELECT 'a|_' like 'a||_' escape > '|', 'a|_' like 'a||_' escape 'a'); > [COLUMN_ALREADY_EXISTS] The column `a|_ like a||_` already exists. Consider > to choose another name or rename the existing column. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45398) Include `ESCAPE` to `sql()` of `Like`
[ https://issues.apache.org/jira/browse/SPARK-45398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45398: --- Labels: pull-request-available (was: ) > Include `ESCAPE` to `sql()` of `Like` > - > > Key: SPARK-45398 > URL: https://issues.apache.org/jira/browse/SPARK-45398 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > > Fix the `sql()` method of the `Like` expression and append the `ESCAPE` > closure. That should become consistent to `toString` and fix the issue: > {code:sql} > spark-sql (default)> create temp view tbl as (SELECT 'a|_' like 'a||_' escape > '|', 'a|_' like 'a||_' escape 'a'); > [COLUMN_ALREADY_EXISTS] The column `a|_ like a||_` already exists. Consider > to choose another name or rename the existing column. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45400) Refer to the unescaping rules from expression descriptions
[ https://issues.apache.org/jira/browse/SPARK-45400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45400: --- Labels: pull-request-available (was: ) > Refer to the unescaping rules from expression descriptions > -- > > Key: SPARK-45400 > URL: https://issues.apache.org/jira/browse/SPARK-45400 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Minor > Labels: pull-request-available > > Update the expression/function description and refer to the unescaping rules > in the items where regexp parameters are described. This should less confuse > users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44223) Drop leveldb support
[ https://issues.apache.org/jira/browse/SPARK-44223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771736#comment-17771736 ] Dongjoon Hyun commented on SPARK-44223: --- This will happen at Apache Spark 4.1.0 because SPARK-45351 is targeting Apache Spark 4.0.0. > Drop leveldb support > > > Key: SPARK-44223 > URL: https://issues.apache.org/jira/browse/SPARK-44223 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > The leveldb project seems to be no longer maintained, and we can always > replace it with rocksdb. I think we can remove support and dependencies on > leveldb in Spark 4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44223) Drop leveldb support
[ https://issues.apache.org/jira/browse/SPARK-44223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44223: -- Parent: (was: SPARK-44111) Issue Type: Task (was: Sub-task) > Drop leveldb support > > > Key: SPARK-44223 > URL: https://issues.apache.org/jira/browse/SPARK-44223 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > The leveldb project seems to be no longer maintained, and we can always > replace it with rocksdb. I think we can remove support and dependencies on > leveldb in Spark 4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45409) Pin `torch<=2.0.1`
[ https://issues.apache.org/jira/browse/SPARK-45409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45409. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43207 [https://github.com/apache/spark/pull/43207] > Pin `torch<=2.0.1` > -- > > Key: SPARK-45409 > URL: https://issues.apache.org/jira/browse/SPARK-45409 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45409) Pin `torch<=2.0.1`
[ https://issues.apache.org/jira/browse/SPARK-45409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45409: - Assignee: Dongjoon Hyun > Pin `torch<=2.0.1` > -- > > Key: SPARK-45409 > URL: https://issues.apache.org/jira/browse/SPARK-45409 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job
[ https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45407: - Assignee: Dongjoon Hyun > Skip Unidoc in SparkR GitHub Action Job > --- > > Key: SPARK-45407 > URL: https://issues.apache.org/jira/browse/SPARK-45407 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job
[ https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45407. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43208 [https://github.com/apache/spark/pull/43208] > Skip Unidoc in SparkR GitHub Action Job > --- > > Key: SPARK-45407 > URL: https://issues.apache.org/jira/browse/SPARK-45407 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org