[jira] [Resolved] (SPARK-43719) Handle missing row.excludedInStages field
[ https://issues.apache.org/jira/browse/SPARK-43719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43719. --- Fix Version/s: 3.3.3 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 41266 [https://github.com/apache/spark/pull/41266] > Handle missing row.excludedInStages field > - > > Key: SPARK-43719 > URL: https://issues.apache.org/jira/browse/SPARK-43719 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.3.3, 3.5.0, 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43719) Handle missing row.excludedInStages field
[ https://issues.apache.org/jira/browse/SPARK-43719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43719: - Assignee: Dongjoon Hyun > Handle missing row.excludedInStages field > - > > Key: SPARK-43719 > URL: https://issues.apache.org/jira/browse/SPARK-43719 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43719) Handle missing row.excludedInStages field
[ https://issues.apache.org/jira/browse/SPARK-43719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43719: -- Affects Version/s: 3.3.2 > Handle missing row.excludedInStages field > - > > Key: SPARK-43719 > URL: https://issues.apache.org/jira/browse/SPARK-43719 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43718. --- Fix Version/s: 3.3.3 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 41267 [https://github.com/apache/spark/pull/41267] > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.3.3, 3.5.0, 3.4.1 > > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. > Queries that don't use arrays also can get wrong results. Assume this data: > {noformat} > create or replace temp view t1 as values (0), (1), (2) as (c1); > create or replace temp view t2 as values (1), (2), (3) as (c1); > create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); > {noformat} > The following query produces incorrect results: > {noformat} > select t1.c1 as t1_c1, t2.c1 as t2_c1, b > from t1 > full outer join t2 > using (c1), > lateral ( > select b > from t3 > where a = coalesce(t2.c1, 1) > ) lt3; > 1 1 2 > NULL 3 4 > Time taken: 2.395 seconds, Fetched 2 row(s) > spark-sql (default)> > {noformat} > The result should be the following: > {noformat} > 0 NULL2 > 1 1 2 > NULL 3 4 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43718: - Assignee: Bruce Robbins > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. > Queries that don't use arrays also can get wrong results. Assume this data: > {noformat} > create or replace temp view t1 as values (0), (1), (2) as (c1); > create or replace temp view t2 as values (1), (2), (3) as (c1); > create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); > {noformat} > The following query produces incorrect results: > {noformat} > select t1.c1 as t1_c1, t2.c1 as t2_c1, b > from t1 > full outer join t2 > using (c1), > lateral ( > select b > from t3 > where a = coalesce(t2.c1, 1) > ) lt3; > 1 1 2 > NULL 3 4 > Time taken: 2.395 seconds, Fetched 2 row(s) > spark-sql (default)> > {noformat} > The result should be the following: > {noformat} > 0 NULL2 > 1 1 2 > NULL 3 4 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43590) Make `CheckConnectJvmClientCompatibility` to compare client and protobuf
[ https://issues.apache.org/jira/browse/SPARK-43590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-43590. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41235 [https://github.com/apache/spark/pull/41235] > Make `CheckConnectJvmClientCompatibility` to compare client and protobuf > - > > Key: SPARK-43590 > URL: https://issues.apache.org/jira/browse/SPARK-43590 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43590) Make `CheckConnectJvmClientCompatibility` to compare client and protobuf
[ https://issues.apache.org/jira/browse/SPARK-43590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-43590: Assignee: Yang Jie > Make `CheckConnectJvmClientCompatibility` to compare client and protobuf > - > > Key: SPARK-43590 > URL: https://issues.apache.org/jira/browse/SPARK-43590 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43739) Upgrade commons-io to 2.12.0
[ https://issues.apache.org/jira/browse/SPARK-43739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725200#comment-17725200 ] Snoot.io commented on SPARK-43739: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41271 > Upgrade commons-io to 2.12.0 > > > Key: SPARK-43739 > URL: https://issues.apache.org/jira/browse/SPARK-43739 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43603) Rebalance pyspark.pandas.DataFrame Unit Tests
[ https://issues.apache.org/jira/browse/SPARK-43603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725198#comment-17725198 ] Snoot.io commented on SPARK-43603: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/41258 > Rebalance pyspark.pandas.DataFrame Unit Tests > - > > Key: SPARK-43603 > URL: https://issues.apache.org/jira/browse/SPARK-43603 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43384) Make `df.show` print a nice string for MapType
[ https://issues.apache.org/jira/browse/SPARK-43384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725195#comment-17725195 ] Snoot.io commented on SPARK-43384: -- User 'Yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/41065 > Make `df.show` print a nice string for MapType > -- > > Key: SPARK-43384 > URL: https://issues.apache.org/jira/browse/SPARK-43384 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: yikaifei >Priority: Minor > > Make `df.show` print a nice string for MapType. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43719) Handle missing row.excludedInStages field
[ https://issues.apache.org/jira/browse/SPARK-43719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725193#comment-17725193 ] Snoot.io commented on SPARK-43719: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/41266 > Handle missing row.excludedInStages field > - > > Key: SPARK-43719 > URL: https://issues.apache.org/jira/browse/SPARK-43719 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43738) Upgrade dropwizard metrics 4.2.18
[ https://issues.apache.org/jira/browse/SPARK-43738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725191#comment-17725191 ] Snoot.io commented on SPARK-43738: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41270 > Upgrade dropwizard metrics 4.2.18 > - > > Key: SPARK-43738 > URL: https://issues.apache.org/jira/browse/SPARK-43738 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43740) Hide unsupported session methods from auto-completion
Ruifeng Zheng created SPARK-43740: - Summary: Hide unsupported session methods from auto-completion Key: SPARK-43740 URL: https://issues.apache.org/jira/browse/SPARK-43740 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43586) There will be many invalid tasks when `Range.numSlices` > `Range.numElements`
[ https://issues.apache.org/jira/browse/SPARK-43586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725185#comment-17725185 ] Snoot.io commented on SPARK-43586: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/41230 > There will be many invalid tasks when `Range.numSlices` > `Range.numElements` > - > > Key: SPARK-43586 > URL: https://issues.apache.org/jira/browse/SPARK-43586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > Attachments: image-2023-05-19-13-01-19-589.png > > > For example, start a spark shell with `--master "local[100]"`, then run > `spark.range(10).map(_ + 1).reduce(_ + _)`, there will be 100 tasks in the > job, although there are only 10 elements in the Range: > !image-2023-05-19-13-01-19-589.png|width=733,height=203! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42958) Refactor `CheckConnectJvmClientCompatibility` to compare client and avro
[ https://issues.apache.org/jira/browse/SPARK-42958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725184#comment-17725184 ] Snoot.io commented on SPARK-42958: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/41233 > Refactor `CheckConnectJvmClientCompatibility` to compare client and avro > > > Key: SPARK-42958 > URL: https://issues.apache.org/jira/browse/SPARK-42958 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43590) Make `CheckConnectJvmClientCompatibility` to compare client and protobuf
[ https://issues.apache.org/jira/browse/SPARK-43590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725183#comment-17725183 ] Snoot.io commented on SPARK-43590: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/41235 > Make `CheckConnectJvmClientCompatibility` to compare client and protobuf > - > > Key: SPARK-43590 > URL: https://issues.apache.org/jira/browse/SPARK-43590 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43612) Python: Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-43612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725182#comment-17725182 ] Snoot.io commented on SPARK-43612: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/41250 > Python: Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-43612 > URL: https://issues.apache.org/jira/browse/SPARK-43612 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0 > > > Should implement https://issues.apache.org/jira/browse/SPARK-42653 in Python > Spark Connect client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43625) Document the difference between `Drop(column)` and `Drop(columnName)`
[ https://issues.apache.org/jira/browse/SPARK-43625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725181#comment-17725181 ] Snoot.io commented on SPARK-43625: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/41273 > Document the difference between `Drop(column)` and `Drop(columnName)` > - > > Key: SPARK-43625 > URL: https://issues.apache.org/jira/browse/SPARK-43625 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43612) Python: Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-43612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43612. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41250 [https://github.com/apache/spark/pull/41250] > Python: Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-43612 > URL: https://issues.apache.org/jira/browse/SPARK-43612 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0 > > > Should implement https://issues.apache.org/jira/browse/SPARK-42653 in Python > Spark Connect client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43612) Python: Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-43612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43612: Assignee: Hyukjin Kwon > Python: Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-43612 > URL: https://issues.apache.org/jira/browse/SPARK-43612 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Should implement https://issues.apache.org/jira/browse/SPARK-42653 in Python > Spark Connect client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43540) Add working directory into classpath on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-43540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725178#comment-17725178 ] Snoot.io commented on SPARK-43540: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/41201 > Add working directory into classpath on the driver in K8S cluster mode > -- > > Key: SPARK-43540 > URL: https://issues.apache.org/jira/browse/SPARK-43540 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > In Yarn cluster modes, the passed files/jars are able to be accessed in the > classloader. Looks like this is not the case in Kubernetes cluster mode. > After SPARK-33782, it places spark.files, spark.jars and spark.files under > the current working directory on the driver in K8S cluster mode. but the > spark.files and spark.jars seems are not accessible by the classloader. > > we need to add the current working directory into classpath. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43625) Document the difference between `Drop(column)` and `Drop(columnName)`
[ https://issues.apache.org/jira/browse/SPARK-43625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725179#comment-17725179 ] Snoot.io commented on SPARK-43625: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/41273 > Document the difference between `Drop(column)` and `Drop(columnName)` > - > > Key: SPARK-43625 > URL: https://issues.apache.org/jira/browse/SPARK-43625 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43737) Upgrade zstd-jni to 1.5.5-3
[ https://issues.apache.org/jira/browse/SPARK-43737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725177#comment-17725177 ] Snoot.io commented on SPARK-43737: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41269 > Upgrade zstd-jni to 1.5.5-3 > --- > > Key: SPARK-43737 > URL: https://issues.apache.org/jira/browse/SPARK-43737 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43625) Document the difference between `Drop(column)` and `Drop(columnName)`
[ https://issues.apache.org/jira/browse/SPARK-43625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43625: -- Component/s: (was: SQL) Documentation > Document the difference between `Drop(column)` and `Drop(columnName)` > - > > Key: SPARK-43625 > URL: https://issues.apache.org/jira/browse/SPARK-43625 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43651) Assign a name to the error class _LEGACY_ERROR_TEMP_2403
[ https://issues.apache.org/jira/browse/SPARK-43651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725168#comment-17725168 ] Snoot.io commented on SPARK-43651: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41252 > Assign a name to the error class _LEGACY_ERROR_TEMP_2403 > > > Key: SPARK-43651 > URL: https://issues.apache.org/jira/browse/SPARK-43651 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43651) Assign a name to the error class _LEGACY_ERROR_TEMP_2403
[ https://issues.apache.org/jira/browse/SPARK-43651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725167#comment-17725167 ] Snoot.io commented on SPARK-43651: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41252 > Assign a name to the error class _LEGACY_ERROR_TEMP_2403 > > > Key: SPARK-43651 > URL: https://issues.apache.org/jira/browse/SPARK-43651 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43649) Assign a name to the error class _LEGACY_ERROR_TEMP_2401
[ https://issues.apache.org/jira/browse/SPARK-43649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725166#comment-17725166 ] Snoot.io commented on SPARK-43649: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41252 > Assign a name to the error class _LEGACY_ERROR_TEMP_2401 > > > Key: SPARK-43649 > URL: https://issues.apache.org/jira/browse/SPARK-43649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43650) Assign a name to the error class _LEGACY_ERROR_TEMP_2402
[ https://issues.apache.org/jira/browse/SPARK-43650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725165#comment-17725165 ] Snoot.io commented on SPARK-43650: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41252 > Assign a name to the error class _LEGACY_ERROR_TEMP_2402 > > > Key: SPARK-43650 > URL: https://issues.apache.org/jira/browse/SPARK-43650 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41532) DF operations that involve multiple data frames should fail if sessions don't match
[ https://issues.apache.org/jira/browse/SPARK-41532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725164#comment-17725164 ] Snoot.io commented on SPARK-41532: -- User 'Hisoka-X' has created a pull request for this issue: https://github.com/apache/spark/pull/41259 > DF operations that involve multiple data frames should fail if sessions don't > match > --- > > Key: SPARK-41532 > URL: https://issues.apache.org/jira/browse/SPARK-41532 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Jia Fan >Priority: Major > Fix For: 3.5.0 > > > We do not support joining for example two data frames from different Spark > Connect Sessions. To avoid exceptions, the client should clearly fail when it > tries to construct such a composition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43649) Assign a name to the error class _LEGACY_ERROR_TEMP_2401
[ https://issues.apache.org/jira/browse/SPARK-43649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725163#comment-17725163 ] Snoot.io commented on SPARK-43649: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41252 > Assign a name to the error class _LEGACY_ERROR_TEMP_2401 > > > Key: SPARK-43649 > URL: https://issues.apache.org/jira/browse/SPARK-43649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43604) Refactor `INVALID_SQL_SYNTAX` for avoiding to embed error's text in source code
[ https://issues.apache.org/jira/browse/SPARK-43604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725162#comment-17725162 ] Snoot.io commented on SPARK-43604: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41254 > Refactor `INVALID_SQL_SYNTAX` for avoiding to embed error's text in source > code > --- > > Key: SPARK-43604 > URL: https://issues.apache.org/jira/browse/SPARK-43604 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > > As discussed in PR(https://github.com/apache/spark/pull/41214), embed error's > text in source code is unfriendly, such as hindering the internationalization > of subsequent error messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43739) Upgrade commons-io to 2.12.0
[ https://issues.apache.org/jira/browse/SPARK-43739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-43739: Summary: Upgrade commons-io to 2.12.0 (was: Upgrade commons-io & commons-crypto to newest version) > Upgrade commons-io to 2.12.0 > > > Key: SPARK-43739 > URL: https://issues.apache.org/jira/browse/SPARK-43739 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43739) Upgrade commons-io & commons-crypto to newest version
[ https://issues.apache.org/jira/browse/SPARK-43739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-43739: Summary: Upgrade commons-io & commons-crypto to newest version (was: Upgrade 'commons-io' & 'commons-crypto' to newest version) > Upgrade commons-io & commons-crypto to newest version > - > > Key: SPARK-43739 > URL: https://issues.apache.org/jira/browse/SPARK-43739 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43739) Upgrade 'commons-io' & 'commons-crypto' to newest version
[ https://issues.apache.org/jira/browse/SPARK-43739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-43739: Summary: Upgrade 'commons-io' & 'commons-crypto' to newest version (was: Upgrade `commons-io` & `commons-crypto` to newest version) > Upgrade 'commons-io' & 'commons-crypto' to newest version > - > > Key: SPARK-43739 > URL: https://issues.apache.org/jira/browse/SPARK-43739 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43739) Upgrade `commons-io` & `commons-crypto` to newest version
BingKun Pan created SPARK-43739: --- Summary: Upgrade `commons-io` & `commons-crypto` to newest version Key: SPARK-43739 URL: https://issues.apache.org/jira/browse/SPARK-43739 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43738) Upgrade dropwizard metrics 4.2.18
BingKun Pan created SPARK-43738: --- Summary: Upgrade dropwizard metrics 4.2.18 Key: SPARK-43738 URL: https://issues.apache.org/jira/browse/SPARK-43738 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42996) Assign JIRA tickets and add comments for all failing tests.
[ https://issues.apache.org/jira/browse/SPARK-42996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42996. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41255 [https://github.com/apache/spark/pull/41255] > Assign JIRA tickets and add comments for all failing tests. > --- > > Key: SPARK-42996 > URL: https://issues.apache.org/jira/browse/SPARK-42996 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > Adding details to parity tests instead of just "Fails in Spark Connect, > should enable". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42996) Assign JIRA tickets and add comments for all failing tests.
[ https://issues.apache.org/jira/browse/SPARK-42996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42996: - Assignee: Haejoon Lee > Assign JIRA tickets and add comments for all failing tests. > --- > > Key: SPARK-42996 > URL: https://issues.apache.org/jira/browse/SPARK-42996 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Adding details to parity tests instead of just "Fails in Spark Connect, > should enable". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43737) Upgrade zstd-jni to 1.5.5-3
BingKun Pan created SPARK-43737: --- Summary: Upgrade zstd-jni to 1.5.5-3 Key: SPARK-43737 URL: https://issues.apache.org/jira/browse/SPARK-43737 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725144#comment-17725144 ] Ritika Maheshwari commented on SPARK-43514: --- Unable to reproduce the error on 3.4.0 See the attachment. !Screen Shot 2023-05-22 at 5.39.55 PM.png! > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.2, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.2 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > Attachments: Screen Shot 2023-05-22 at 5.39.55 PM.png > > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have > at least 1 non zero entry. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) > at
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ritika Maheshwari updated SPARK-43514: -- Attachment: Screen Shot 2023-05-22 at 5.39.55 PM.png > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.2, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.2 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > Attachments: Screen Shot 2023-05-22 at 5.39.55 PM.png > > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have > at least 1 non zero entry. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) > at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) > ... many more > {code}
[jira] [Commented] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725143#comment-17725143 ] Bruce Robbins commented on SPARK-43718: --- PR here: https://github.com/apache/spark/pull/41267 > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. > Queries that don't use arrays also can get wrong results. Assume this data: > {noformat} > create or replace temp view t1 as values (0), (1), (2) as (c1); > create or replace temp view t2 as values (1), (2), (3) as (c1); > create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); > {noformat} > The following query produces incorrect results: > {noformat} > select t1.c1 as t1_c1, t2.c1 as t2_c1, b > from t1 > full outer join t2 > using (c1), > lateral ( > select b > from t3 > where a = coalesce(t2.c1, 1) > ) lt3; > 1 1 2 > NULL 3 4 > Time taken: 2.395 seconds, Fetched 2 row(s) > spark-sql (default)> > {noformat} > The result should be the following: > {noformat} > 0 NULL2 > 1 1 2 > NULL 3 4 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43546) Complete parity tests of Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-43546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-43546: - Summary: Complete parity tests of Pandas UDF (was: Complete Pandas UDF parity tests) > Complete parity tests of Pandas UDF > --- > > Key: SPARK-43546 > URL: https://issues.apache.org/jira/browse/SPARK-43546 > Project: Spark > Issue Type: Test > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Priority: Major > > Tests as shown below should be added to Connect. > test_pandas_udf_grouped_agg.py > test_pandas_udf_scalar.py > test_pandas_udf_window.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43735) Enable SeriesDateTimeTests.test_weekday for pandas 2.0.0.
Haejoon Lee created SPARK-43735: --- Summary: Enable SeriesDateTimeTests.test_weekday for pandas 2.0.0. Key: SPARK-43735 URL: https://issues.apache.org/jira/browse/SPARK-43735 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43733) Enable SeriesDateTimeTests.test_second for pandas 2.0.0.
Haejoon Lee created SPARK-43733: --- Summary: Enable SeriesDateTimeTests.test_second for pandas 2.0.0. Key: SPARK-43733 URL: https://issues.apache.org/jira/browse/SPARK-43733 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43736) Enable SeriesDateTimeTests.test_year for pandas 2.0.0.
Haejoon Lee created SPARK-43736: --- Summary: Enable SeriesDateTimeTests.test_year for pandas 2.0.0. Key: SPARK-43736 URL: https://issues.apache.org/jira/browse/SPARK-43736 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43734) Expression "(v)" within a window function doesn't raise a AnalysisException
Xinrong Meng created SPARK-43734: Summary: Expression "(v)" within a window function doesn't raise a AnalysisException Key: SPARK-43734 URL: https://issues.apache.org/jira/browse/SPARK-43734 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng Expression "(v)" within a window function doesn't raise a AnalysisException See PandasUDFWindowParityTests.test_invalid_args for reproduction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43731) Enable SeriesDateTimeTests.test_month for pandas 2.0.0.
Haejoon Lee created SPARK-43731: --- Summary: Enable SeriesDateTimeTests.test_month for pandas 2.0.0. Key: SPARK-43731 URL: https://issues.apache.org/jira/browse/SPARK-43731 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43732) Enable SeriesDateTimeTests.test_quarter for pandas 2.0.0.
Haejoon Lee created SPARK-43732: --- Summary: Enable SeriesDateTimeTests.test_quarter for pandas 2.0.0. Key: SPARK-43732 URL: https://issues.apache.org/jira/browse/SPARK-43732 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43729) Enable SeriesDateTimeTests.test_microsecond for pandas 2.0.0.
Haejoon Lee created SPARK-43729: --- Summary: Enable SeriesDateTimeTests.test_microsecond for pandas 2.0.0. Key: SPARK-43729 URL: https://issues.apache.org/jira/browse/SPARK-43729 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43728) Enable SeriesDateTimeTests.test_hour for pandas 2.0.0.
Haejoon Lee created SPARK-43728: --- Summary: Enable SeriesDateTimeTests.test_hour for pandas 2.0.0. Key: SPARK-43728 URL: https://issues.apache.org/jira/browse/SPARK-43728 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43730) Enable SeriesDateTimeTests.test_minute for pandas 2.0.0.
Haejoon Lee created SPARK-43730: --- Summary: Enable SeriesDateTimeTests.test_minute for pandas 2.0.0. Key: SPARK-43730 URL: https://issues.apache.org/jira/browse/SPARK-43730 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43727) Parity returnType check in Spark Connect
Xinrong Meng created SPARK-43727: Summary: Parity returnType check in Spark Connect Key: SPARK-43727 URL: https://issues.apache.org/jira/browse/SPARK-43727 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43726) Enable SeriesDateTimeTests.test_daysinmonth for pandas 2.0.0.
Haejoon Lee created SPARK-43726: --- Summary: Enable SeriesDateTimeTests.test_daysinmonth for pandas 2.0.0. Key: SPARK-43726 URL: https://issues.apache.org/jira/browse/SPARK-43726 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43725) Enable SeriesDateTimeTests.test_days_in_month for pandas 2.0.0.
Haejoon Lee created SPARK-43725: --- Summary: Enable SeriesDateTimeTests.test_days_in_month for pandas 2.0.0. Key: SPARK-43725 URL: https://issues.apache.org/jira/browse/SPARK-43725 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43724) Enable SeriesDateTimeTests.test_dayofyear for pandas 2.0.0.
Haejoon Lee created SPARK-43724: --- Summary: Enable SeriesDateTimeTests.test_dayofyear for pandas 2.0.0. Key: SPARK-43724 URL: https://issues.apache.org/jira/browse/SPARK-43724 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43723) Enable SeriesDateTimeTests.test_dayofweek for pandas 2.0.0.
Haejoon Lee created SPARK-43723: --- Summary: Enable SeriesDateTimeTests.test_dayofweek for pandas 2.0.0. Key: SPARK-43723 URL: https://issues.apache.org/jira/browse/SPARK-43723 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43722) Enable SeriesDateTimeTests.test_day for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43722: Summary: Enable SeriesDateTimeTests.test_day for pandas 2.0.0. (was: Enable SeriesTests.test_day for pandas 2.0.0.) > Enable SeriesDateTimeTests.test_day for pandas 2.0.0. > - > > Key: SPARK-43722 > URL: https://issues.apache.org/jira/browse/SPARK-43722 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43721) Enable DataFramePlotMatplotlibTests.test_kde_plot for pandas 2.0.0.
Haejoon Lee created SPARK-43721: --- Summary: Enable DataFramePlotMatplotlibTests.test_kde_plot for pandas 2.0.0. Key: SPARK-43721 URL: https://issues.apache.org/jira/browse/SPARK-43721 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43722) Enable SeriesTests.test_day for pandas 2.0.0.
Haejoon Lee created SPARK-43722: --- Summary: Enable SeriesTests.test_day for pandas 2.0.0. Key: SPARK-43722 URL: https://issues.apache.org/jira/browse/SPARK-43722 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43720) Enable DataFramePlotMatplotlibTests.test_hist_plot for pandas 2.0.0.
Haejoon Lee created SPARK-43720: --- Summary: Enable DataFramePlotMatplotlibTests.test_hist_plot for pandas 2.0.0. Key: SPARK-43720 URL: https://issues.apache.org/jira/browse/SPARK-43720 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43715) Add spark DataFrame binary file reader / writer
[ https://issues.apache.org/jira/browse/SPARK-43715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-43715: -- Assignee: Weichen Xu > Add spark DataFrame binary file reader / writer > --- > > Key: SPARK-43715 > URL: https://issues.apache.org/jira/browse/SPARK-43715 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > In new distributed spark ML module (designed to support spark connect and > support local inference) > We need to save ML model to hadoop file system using custom binary file > format, the reason is: > * We often submit a spark application to spark cluster for running the > training model job, we need to save trained model to hadoop file system > before the spark application completes. > * But we want to support local model inference, that means if we save the > model by current spark DataFrame writer (e.g. parquet format), when loading > model we have to rely on the spark service. But we hope we can load model > without spark service. So we want the model being saved as the original > binary format that our ML code can handle. > so we need to add a DataFrame reader / writer format, that can load / save > binary files, the API is like: > > {*}Writer API{*}: > Supposing we have a dataframe with schema: > [file_path: String, content: binary], > we can save the dataframe to a hadoop path, each row we will save it as a > file under the hadoop path, the saved file path is \{hadoop > path}/\{file_path}, "file_path" can be a multiple part path. > > {*}Reader API{*}: > `spark.read.format("binaryFileV2").load(...)` > > It will return a spark dataframe , each row contains the file path and the > file content binary string. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43538: Issue Type: Bug (was: Request) > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.4.0 > > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43538) Spark Homebrew Formulae currently depends on non-officially-supported Java 20
[ https://issues.apache.org/jira/browse/SPARK-43538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43538. - Fix Version/s: 3.4.0 Assignee: Yuming Wang Resolution: Fixed Issue resolved by pull request [https://github.com/Homebrew/homebrew-core/pull/131189]. Please reinstall it if you have installed the Spark Homebrew formulae. > Spark Homebrew Formulae currently depends on non-officially-supported Java 20 > - > > Key: SPARK-43538 > URL: https://issues.apache.org/jira/browse/SPARK-43538 > Project: Spark > Issue Type: Request > Components: Java API >Affects Versions: 3.2.4, 3.3.2, 3.4.0 > Environment: Homebrew (e.g., macOS) >Reporter: Ghislain Fourny >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.4.0 > > > I am not sure if homebrew-related issues can also be reported here? The > Homebrew formulae for apache-spark runs on (latest) openjdk 20. > [https://formulae.brew.sh/formula/apache-spark] > However, Apache Spark is documented to work with Java 8/11/17: > [https://spark.apache.org/docs/latest/] > Is this an overlook, or is Java 20 officially supported, too? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43719) Handle missing row.excludedInStages field
Dongjoon Hyun created SPARK-43719: - Summary: Handle missing row.excludedInStages field Key: SPARK-43719 URL: https://issues.apache.org/jira/browse/SPARK-43719 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725129#comment-17725129 ] Saiwing Yeung commented on SPARK-40500: --- This would also be useful for me. If it's useful I can make a PR that patches 3.2 (same scope as https://github.com/apache/spark/pull/37947). > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Description: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces incorrect results: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. Queries that don't use arrays also can get wrong results. Assume this data: {noformat} create or replace temp view t1 as values (0), (1), (2) as (c1); create or replace temp view t2 as values (1), (2), (3) as (c1); create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); {noformat} The following query produces incorrect results: {noformat} select t1.c1 as t1_c1, t2.c1 as t2_c1, b from t1 full outer join t2 using (c1), lateral ( select b from t3 where a = coalesce(t2.c1, 1) ) lt3; 1 1 2 NULL3 4 Time taken: 2.395 seconds, Fetched 2 row(s) spark-sql (default)> {noformat} The result should be the following: {noformat} 0 NULL2 1 1 2 NULL3 4 {noformat} was: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces incorrect results: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. > Queries that don't use arrays also can get wrong results. Assume this data: > {noformat} > create or replace temp view t1 as values (0), (1), (2) as (c1); > create or replace temp view t2 as values (1), (2), (3) as (c1); > create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); > {noformat} > The following query produces incorrect results: > {noformat} > select t1.c1 as t1_c1, t2.c1 as t2_c1, b > from t1 > full outer join t2 > using (c1), > lateral ( > select b > from t3 > where a = coalesce(t2.c1, 1) > ) lt3; > 1 1 2 > NULL 3 4 > Time taken: 2.395 seconds, Fetched 2 row(s) > spark-sql (default)> > {noformat} > The result should be the following: > {noformat} > 0 NULL2 > 1 1 2 > NULL 3 4 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Affects Version/s: 3.3.2 > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Affects Version/s: 3.4.0 > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725122#comment-17725122 ] Bruce Robbins commented on SPARK-43718: --- I think I have a handle on this. I will submit in a PR in the coming days. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43716) Revert scala-maven-plugin to 4.8.0
[ https://issues.apache.org/jira/browse/SPARK-43716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43716: -- Summary: Revert scala-maven-plugin to 4.8.0 (was: Revert scala-maven-plugin upgrade) > Revert scala-maven-plugin to 4.8.0 > -- > > Key: SPARK-43716 > URL: https://issues.apache.org/jira/browse/SPARK-43716 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42809) Upgrade scala-maven-plugin from 4.8.0 to 4.8.1
[ https://issues.apache.org/jira/browse/SPARK-42809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725121#comment-17725121 ] Dongjoon Hyun commented on SPARK-42809: --- This is logically reverted via SPARK-43716 > Upgrade scala-maven-plugin from 4.8.0 to 4.8.1 > -- > > Key: SPARK-42809 > URL: https://issues.apache.org/jira/browse/SPARK-42809 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43716) Revert scala-maven-plugin upgrade
[ https://issues.apache.org/jira/browse/SPARK-43716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43716. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41261 [https://github.com/apache/spark/pull/41261] > Revert scala-maven-plugin upgrade > - > > Key: SPARK-43716 > URL: https://issues.apache.org/jira/browse/SPARK-43716 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43716) Revert scala-maven-plugin upgrade
[ https://issues.apache.org/jira/browse/SPARK-43716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43716: - Assignee: Bjørn Jørgensen > Revert scala-maven-plugin upgrade > - > > Key: SPARK-43716 > URL: https://issues.apache.org/jira/browse/SPARK-43716 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Labels: correctness (was: ) > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Description: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces incorrect results: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. was: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43591) Assign a name to the error class _LEGACY_ERROR_TEMP_0013
[ https://issues.apache.org/jira/browse/SPARK-43591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43591. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41236 [https://github.com/apache/spark/pull/41236] > Assign a name to the error class _LEGACY_ERROR_TEMP_0013 > > > Key: SPARK-43591 > URL: https://issues.apache.org/jira/browse/SPARK-43591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Description: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. was: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces the wrong result: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43591) Assign a name to the error class _LEGACY_ERROR_TEMP_0013
[ https://issues.apache.org/jira/browse/SPARK-43591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-43591: Assignee: BingKun Pan > Assign a name to the error class _LEGACY_ERROR_TEMP_0013 > > > Key: SPARK-43591 > URL: https://issues.apache.org/jira/browse/SPARK-43591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43583) When encryption is enabled on the External Shuffle Service, then processing of push meta requests throws NPE
[ https://issues.apache.org/jira/browse/SPARK-43583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43583: -- Affects Version/s: 3.4.0 3.3.2 3.2.4 (was: 3.2.0) > When encryption is enabled on the External Shuffle Service, then processing > of push meta requests throws NPE > > > Key: SPARK-43583 > URL: https://issues.apache.org/jira/browse/SPARK-43583 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.4, 3.3.2, 3.4.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.5.0 > > > After enabling support for over-the-wire encryption for spark shuffle > services, the meta requests for push-merged blocks fail with this error: > {code:java} > java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.spark.network.server.AbstractAuthRpcHandler.getMergedBlockMetaReqHandler(AbstractAuthRpcHandler.java:110) > at > org.apache.spark.network.crypto.AuthRpcHandler.getMergedBlockMetaReqHandler(AuthRpcHandler.java:144) > at > org.apache.spark.network.server.TransportRequestHandler.processMergedBlockMetaRequest(TransportRequestHandler.java:275) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) > at > org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.sparkproject.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
Bruce Robbins created SPARK-43718: - Summary: References to a specific side's key in a USING join can have wrong nullability Key: SPARK-43718 URL: https://issues.apache.org/jira/browse/SPARK-43718 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43583) When encryption is enabled on the External Shuffle Service, then processing of push meta requests throws NPE
[ https://issues.apache.org/jira/browse/SPARK-43583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43583: - Assignee: Chandni Singh > When encryption is enabled on the External Shuffle Service, then processing > of push meta requests throws NPE > > > Key: SPARK-43583 > URL: https://issues.apache.org/jira/browse/SPARK-43583 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > After enabling support for over-the-wire encryption for spark shuffle > services, the meta requests for push-merged blocks fail with this error: > {code:java} > java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.spark.network.server.AbstractAuthRpcHandler.getMergedBlockMetaReqHandler(AbstractAuthRpcHandler.java:110) > at > org.apache.spark.network.crypto.AuthRpcHandler.getMergedBlockMetaReqHandler(AuthRpcHandler.java:144) > at > org.apache.spark.network.server.TransportRequestHandler.processMergedBlockMetaRequest(TransportRequestHandler.java:275) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) > at > org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.sparkproject.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43583) When encryption is enabled on the External Shuffle Service, then processing of push meta requests throws NPE
[ https://issues.apache.org/jira/browse/SPARK-43583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43583. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41225 [https://github.com/apache/spark/pull/41225] > When encryption is enabled on the External Shuffle Service, then processing > of push meta requests throws NPE > > > Key: SPARK-43583 > URL: https://issues.apache.org/jira/browse/SPARK-43583 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.5.0 > > > After enabling support for over-the-wire encryption for spark shuffle > services, the meta requests for push-merged blocks fail with this error: > {code:java} > java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.spark.network.server.AbstractAuthRpcHandler.getMergedBlockMetaReqHandler(AbstractAuthRpcHandler.java:110) > at > org.apache.spark.network.crypto.AuthRpcHandler.getMergedBlockMetaReqHandler(AuthRpcHandler.java:144) > at > org.apache.spark.network.server.TransportRequestHandler.processMergedBlockMetaRequest(TransportRequestHandler.java:275) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:117) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) > at > org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.sparkproject.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43487) Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError`
[ https://issues.apache.org/jira/browse/SPARK-43487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43487. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41155 [https://github.com/apache/spark/pull/41155] > Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError` > - > > Key: SPARK-43487 > URL: https://issues.apache.org/jira/browse/SPARK-43487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Johan Lasperas >Assignee: Johan Lasperas >Priority: Minor > Fix For: 3.5.0 > > > The batch of errors migrated to error classes as part of spark-40540 contains > an error that got mixed up with the wrong error message: > [ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983] > uses the same error message as the following > commandUnsupportedInV2TableError: > > {code:java} > WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * > FROM t2; > AnalysisException: t is not supported for v2 tables > {code} > The error should be: > {code:java} > AnalysisException: Name tis ambiguous in nested CTE. > Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name > defined in inner CTE takes precedence. If set it to LEGACY, outer CTE > definitions will take precedence. See more details in SPARK-28228.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43487) Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError`
[ https://issues.apache.org/jira/browse/SPARK-43487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-43487: Assignee: Johan Lasperas > Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError` > - > > Key: SPARK-43487 > URL: https://issues.apache.org/jira/browse/SPARK-43487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Johan Lasperas >Assignee: Johan Lasperas >Priority: Minor > > The batch of errors migrated to error classes as part of spark-40540 contains > an error that got mixed up with the wrong error message: > [ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983] > uses the same error message as the following > commandUnsupportedInV2TableError: > > {code:java} > WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * > FROM t2; > AnalysisException: t is not supported for v2 tables > {code} > The error should be: > {code:java} > AnalysisException: Name tis ambiguous in nested CTE. > Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name > defined in inner CTE takes precedence. If set it to LEGACY, outer CTE > definitions will take precedence. See more details in SPARK-28228.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43290) Support IV and AAD optional parameters for aes_encrypt
[ https://issues.apache.org/jira/browse/SPARK-43290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43290. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40970 [https://github.com/apache/spark/pull/40970] > Support IV and AAD optional parameters for aes_encrypt > -- > > Key: SPARK-43290 > URL: https://issues.apache.org/jira/browse/SPARK-43290 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Steve Weis >Assignee: Steve Weis >Priority: Minor > Fix For: 3.5.0 > > > There are some use cases where callers to aes_encrypt may want to provide > initialization vectors (IVs) or additional authenticated data (AAD). The most > common cases will be: > 1. Ensuring that ciphertext matches values that have been encrypted by > external tools. In those cases, the caller will need to provide an identical > IV value. > 2. For AES-CBC mode, there are some cases where callers want to generate > deterministic encrypted output. > 3. For AES-GCM mode, providing AAD fields allows callers to bind additional > data to an encrypted ciphertext so that it can only be decrypted by a caller > providing the same value. This is often used to enforce some context. > The proposed new API is the following: > * aes_encrypt(expr, key [, mode [, padding [, iv [, aad) > * aes_decrypt(expr, key [, mode [, padding [, aad]]]) > These fields are only supported for specific modes: > * ECB: Does not support either IV or AAD and will return an error if either > are provided. > * CBC: Only supports an IV and will return an error if an AAD is provided > * GCM: Supports either IV, AAD, or both. > If a caller is only providing an AAD to GCM mode, they would need to pass a > null value in the IV field. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43290) Support IV and AAD optional parameters for aes_encrypt
[ https://issues.apache.org/jira/browse/SPARK-43290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-43290: Assignee: Steve Weis > Support IV and AAD optional parameters for aes_encrypt > -- > > Key: SPARK-43290 > URL: https://issues.apache.org/jira/browse/SPARK-43290 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Steve Weis >Assignee: Steve Weis >Priority: Minor > > There are some use cases where callers to aes_encrypt may want to provide > initialization vectors (IVs) or additional authenticated data (AAD). The most > common cases will be: > 1. Ensuring that ciphertext matches values that have been encrypted by > external tools. In those cases, the caller will need to provide an identical > IV value. > 2. For AES-CBC mode, there are some cases where callers want to generate > deterministic encrypted output. > 3. For AES-GCM mode, providing AAD fields allows callers to bind additional > data to an encrypted ciphertext so that it can only be decrypted by a caller > providing the same value. This is often used to enforce some context. > The proposed new API is the following: > * aes_encrypt(expr, key [, mode [, padding [, iv [, aad) > * aes_decrypt(expr, key [, mode [, padding [, aad]]]) > These fields are only supported for specific modes: > * ECB: Does not support either IV or AAD and will return an error if either > are provided. > * CBC: Only supports an IV and will return an error if an AAD is provided > * GCM: Supports either IV, AAD, or both. > If a caller is only providing an AAD to GCM mode, they would need to pass a > null value in the IV field. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43597) Assign a name to the error class _LEGACY_ERROR_TEMP_0017
[ https://issues.apache.org/jira/browse/SPARK-43597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43597. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41241 [https://github.com/apache/spark/pull/41241] > Assign a name to the error class _LEGACY_ERROR_TEMP_0017 > > > Key: SPARK-43597 > URL: https://issues.apache.org/jira/browse/SPARK-43597 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43597) Assign a name to the error class _LEGACY_ERROR_TEMP_0017
[ https://issues.apache.org/jira/browse/SPARK-43597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-43597: Assignee: BingKun Pan > Assign a name to the error class _LEGACY_ERROR_TEMP_0017 > > > Key: SPARK-43597 > URL: https://issues.apache.org/jira/browse/SPARK-43597 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43717) Scala Client Dataset#reduce failed to handle null partitions for scala primitive types
Zhen Li created SPARK-43717: --- Summary: Scala Client Dataset#reduce failed to handle null partitions for scala primitive types Key: SPARK-43717 URL: https://issues.apache.org/jira/browse/SPARK-43717 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Zhen Li Scala client failed with NPE when running: assert(spark.range(0, 5, 1, 10).as[Long].reduce(_ + _) == 10) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43333) Name union type members after types
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725055#comment-17725055 ] Siying Dong commented on SPARK-4: - https://github.com/apache/spark/pull/41263/ > Name union type members after types > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Jose Gonzalez >Priority: Major > > Spark converts Avro union types into record types, where each member of the > union type corresponds to a field in the record type. The current behaviour > is to name the record fields "member0", "member1", etc, for each member of > the union type. We propose having the option to instead use the member type > name. > The purpose of this is twofold: > # To allow adding or removing types to the union without affecting the > record names of other member types. If the new or removed type is not ordered > last, then existing queries referencing "member2" may need to be rewritten to > reference "member1" or "member3". > # Referencing the type name in the query is more readable than referencing > "member0". > For example, our system produces an avro schema from a Java type structure > where subtyping maps to union types whose members are ordered > lexicographically. Adding a subtype can therefore easily result in all > references to "member2" needing to be updated to "member3". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43716) Revert scala-maven-plugin upgrade
Bjørn Jørgensen created SPARK-43716: --- Summary: Revert scala-maven-plugin upgrade Key: SPARK-43716 URL: https://issues.apache.org/jira/browse/SPARK-43716 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 3.5.0 Reporter: Bjørn Jørgensen -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42981) Add direct arrow serialization
[ https://issues.apache.org/jira/browse/SPARK-42981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724993#comment-17724993 ] Herman van Hövell commented on SPARK-42981: --- Nope, I need to pick this up again. Will do so in the next few days. > Add direct arrow serialization > -- > > Key: SPARK-42981 > URL: https://issues.apache.org/jira/browse/SPARK-42981 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43647) Maven test failed in ClientE2ETestSuite/CatalogSuite/StreamingQuerySuite without -Phive
[ https://issues.apache.org/jira/browse/SPARK-43647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43647: - Description: {code:java} build/mvn clean install -DskipTests build/mvn test -pl connector/connect/client/jvm{code} 13 test failed with similar reasons: {code:java} - read and write *** FAILED *** io.grpc.StatusRuntimeException: INTERNAL: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated at io.grpc.Status.asRuntimeException(Status.java:535) at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) at scala.collection.Iterator.toStream(Iterator.scala:1417) at scala.collection.Iterator.toStream$(Iterator.scala:1416) at scala.collection.AbstractIterator.toStream(Iterator.scala:1431) at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:354) at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:354) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1431) at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:489) ... {code} was: {code:java} build/mvn clean install -DskipTests build/mvn test -pl connector/connect/server {code} 13 test failed with similar reasons: {code:java} - read and write *** FAILED *** io.grpc.StatusRuntimeException: INTERNAL: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated at io.grpc.Status.asRuntimeException(Status.java:535) at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) at scala.collection.Iterator.toStream(Iterator.scala:1417) at scala.collection.Iterator.toStream$(Iterator.scala:1416) at scala.collection.AbstractIterator.toStream(Iterator.scala:1431) at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:354) at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:354) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1431) at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:489) ... {code} > Maven test failed in ClientE2ETestSuite/CatalogSuite/StreamingQuerySuite > without -Phive > --- > > Key: SPARK-43647 > URL: https://issues.apache.org/jira/browse/SPARK-43647 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > > {code:java} > build/mvn clean install -DskipTests > build/mvn test -pl connector/connect/client/jvm{code} > > 13 test failed with similar reasons: > > {code:java} > - read and write *** FAILED *** > io.grpc.StatusRuntimeException: INTERNAL: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated > at io.grpc.Status.asRuntimeException(Status.java:535) > at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > at scala.collection.Iterator.toStream(Iterator.scala:1417) > at scala.collection.Iterator.toStream$(Iterator.scala:1416) > at scala.collection.AbstractIterator.toStream(Iterator.scala:1431) > at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:354) > at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:354) > at scala.collection.AbstractIterator.toSeq(Iterator.scala:1431) > at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:489) > ... {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724942#comment-17724942 ] melin commented on SPARK-43338: --- kyuubi verified it: [https://kyuubi.readthedocs.io/en/v1.7.1-rc0/connector/spark/hive.html] kyuubi is implemented based on HiveSessionCatalog. If there are huid tables in the hive database, another Hudi catalog needs to be registered. The same hms has two catalognames, which does not meet my requirements. > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If multiple Hive Metastores exist, the platform manages multiple hms metadata > and classifies them by catalogName. A different catalog name is required > [~fanjia] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724939#comment-17724939 ] Jia Fan commented on SPARK-43338: - `Assign each hms a unique catalogname only so that the meta tableId is unique: catalog.database.table.` I think the Datasource V2 can do that. But i didn't verify it. > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If multiple Hive Metastores exist, the platform manages multiple hms metadata > and classifies them by catalogName. A different catalog name is required > [~fanjia] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37351) Supports write data flow control
[ https://issues.apache.org/jira/browse/SPARK-37351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724934#comment-17724934 ] Jia Fan commented on SPARK-37351: - Do you want data flow control on micro-batch or batch mode? This is a big feature, maybe should create SPIP and send to dev mail list. And I'm not sure the community will accept this change. cc [~cloud_fan] > Supports write data flow control > > > Key: SPARK-37351 > URL: https://issues.apache.org/jira/browse/SPARK-37351 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: melin >Priority: Major > > The hive table data is written to a relational database, generally an online > production database. If the writing speed has no traffic control, it can > easily affect the stability of the online system. It is recommended to add > traffic control parameters > [~fanjia] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724933#comment-17724933 ] melin commented on SPARK-43338: --- I don't need to access multiple hms in the same sparksession, I only need to access one of them. Assign each hms a unique catalogname only so that the meta tableId is unique: catalog.database.table. > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If multiple Hive Metastores exist, the platform manages multiple hms metadata > and classifies them by catalogName. A different catalog name is required > [~fanjia] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43715) Add spark DataFrame binary file reader / writer
[ https://issues.apache.org/jira/browse/SPARK-43715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-43715: --- Description: In new distributed spark ML module (designed to support spark connect and support local inference) We need to save ML model to hadoop file system using custom binary file format, the reason is: * We often submit a spark application to spark cluster for running the training model job, we need to save trained model to hadoop file system before the spark application completes. * But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle. so we need to add a DataFrame reader / writer format, that can load / save binary files, the API is like: {*}Writer API{*}: Supposing we have a dataframe with schema: [file_path: String, content: binary], we can save the dataframe to a hadoop path, each row we will save it as a file under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, "file_path" can be a multiple part path. {*}Reader API{*}: `spark.read.format("binaryFileV2").load(...)` It will return a spark dataframe , each row contains the file path and the file content binary string. was: In new distributed spark ML module (designed to support spark connect and support local inference) We need to save ML model to hadoop file system using custom binary file format, the reason is: * The training model job is a spark job, we need to save trained model to hadoop file sytem after the job completes. * But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle. so we need to add a DataFrame reader / writer format, that can load / save binary files, the API is like: {*}Writer API{*}: Supposing we have a dataframe with schema: [file_path: String, content: binary], we can save the dataframe to a hadoop path, each row we will save it as a file under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, "file_path" can be a multiple part path. {*}Reader API{*}: `spark.read.format("binaryFileV2").load(...)` It will return a spark dataframe , each row contains the file path and the file content binary string. > Add spark DataFrame binary file reader / writer > --- > > Key: SPARK-43715 > URL: https://issues.apache.org/jira/browse/SPARK-43715 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Priority: Major > > In new distributed spark ML module (designed to support spark connect and > support local inference) > We need to save ML model to hadoop file system using custom binary file > format, the reason is: > * We often submit a spark application to spark cluster for running the > training model job, we need to save trained model to hadoop file system > before the spark application completes. > * But we want to support local model inference, that means if we save the > model by current spark DataFrame writer (e.g. parquet format), when loading > model we have to rely on the spark service. But we hope we can load model > without spark service. So we want the model being saved as the original > binary format that our ML code can handle. > so we need to add a DataFrame reader / writer format, that can load / save > binary files, the API is like: > > {*}Writer API{*}: > Supposing we have a dataframe with schema: > [file_path: String, content: binary], > we can save the dataframe to a hadoop path, each row we will save it as a > file under the hadoop path, the saved file path is \{hadoop > path}/\{file_path}, "file_path" can be a multiple part path. > > {*}Reader API{*}: > `spark.read.format("binaryFileV2").load(...)` > > It will return a spark dataframe , each row contains the file path and the > file content binary string. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43715) Add spark DataFrame binary file reader / writer
[ https://issues.apache.org/jira/browse/SPARK-43715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-43715: --- Description: In new distributed spark ML module (designed to support spark connect and support local inference) We need to save ML model to hadoop file system using custom binary file format, the reason is: * The training model job is a spark job, we need to save trained model to hadoop file sytem after the job completes. * But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle. so we need to add a DataFrame reader / writer format, that can load / save binary files, the API is like: {*}Writer API{*}: Supposing we have a dataframe with schema: [file_path: String, content: binary], we can save the dataframe to a hadoop path, each row we will save it as a file under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, "file_path" can be a multiple part path. {*}Reader API{*}: `spark.read.format("binaryFileV2").load(...)` It will return a spark dataframe , each row contains the file path and the file content binary string. was: In new distributed spark ML module (designed to support spark connect and support local inference) We need to save ML model to hadoop file system using custom binary file format, the reason is: * The training model job is a spark job, we need to save trained model to hadoop file sytem after the job completes. * But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle. so we need to add a DataFrame reader / writer format, that can load / save binary files, the API is like: {*}Writer API{*}: Supposing we have a dataframe with schema: [file_path: String, content: binary], we can save the dataframe to a hadoop path, each row we will save it as a file under the hadoop path, the saved file path is \{hadoop path}/\{file_path}, "file_path" can be a multiple part path. Reader API: `spark.read.format("binaryFileV2").load(...)` It will return a spark dataframe , each row contains the file path and the file content binary string. > Add spark DataFrame binary file reader / writer > --- > > Key: SPARK-43715 > URL: https://issues.apache.org/jira/browse/SPARK-43715 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Priority: Major > > In new distributed spark ML module (designed to support spark connect and > support local inference) > We need to save ML model to hadoop file system using custom binary file > format, the reason is: > * The training model job is a spark job, we need to save trained model to > hadoop file sytem after the job completes. > * But we want to support local model inference, that means if we save the > model by current spark DataFrame writer (e.g. parquet format), when loading > model we have to rely on the spark service. But we hope we can load model > without spark service. So we want the model being saved as the original > binary format that our ML code can handle. > so we need to add a DataFrame reader / writer format, that can load / save > binary files, the API is like: > > {*}Writer API{*}: > Supposing we have a dataframe with schema: > [file_path: String, content: binary], > we can save the dataframe to a hadoop path, each row we will save it as a > file under the hadoop path, the saved file path is \{hadoop > path}/\{file_path}, "file_path" can be a multiple part path. > > {*}Reader API{*}: > `spark.read.format("binaryFileV2").load(...)` > > It will return a spark dataframe , each row contains the file path and the > file content binary string. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org