[jira] [Commented] (SPARK-44110) Forked unit tests don't receive proxy system properties

2023-06-21 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735978#comment-17735978
 ] 

Snoot.io commented on SPARK-44110:
--

User 'snmvaughan' has created a pull request for this issue:
https://github.com/apache/spark/pull/41678

> Forked unit tests don't receive proxy system properties
> ---
>
> Key: SPARK-44110
> URL: https://issues.apache.org/jira/browse/SPARK-44110
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
> Environment: Executing unit tests using SBT in an enterprise 
> environment that requires the use of a proxy for external network access.
>Reporter: Steve Vaughan
>Priority: Major
>
> Spark unit tests fail in an enterprise environment that requires the use of a 
> proxy for external network access.  These tests are configured to run as 
> forked JVMs, and don't receive the system properties used to configure the 
> proxy settings for Java.
> Updating the SBT build to propagate the system properties that are related to 
> proxying addresses the issue.  The tested patch checks for any system 
> properties that starts with `http.` or `https.` and adds them to the 
> `jvmOptions` of the forked process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735976#comment-17735976
 ] 

Bruce Robbins commented on SPARK-44132:
---

[~steven.aerts] Go for it!

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Steven Aerts (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735973#comment-17735973
 ] 

Steven Aerts commented on SPARK-44132:
--

[~bersprockets] I did not think of that.

Are you already working on a fix?  If not, I can give it a go.  With your hint, 
I now know where to look for.

Thanks

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735944#comment-17735944
 ] 

Bruce Robbins edited comment on SPARK-44132 at 6/22/23 1:51 AM:


You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221, SPARK-26680).


was (Author: bersprockets):
You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221).

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark 

[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735944#comment-17735944
 ] 

Bruce Robbins commented on SPARK-44132:
---

You may have this figured out already, but in case not, here's a clue.

You can replicate the NPE in {{spark-shell}} as follows:
{noformat}
val dsA = Seq((1, 1)).toDF("id", "a")
val dsB = Seq((2, 2)).toDF("id", "a")
val dsC = Seq((3, 3)).toDF("id", "a")

val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), 
"full_outer");
joined.collectAsList
{noformat}

I think its because the join column sequence {{idSeq}} (in your unit test) is 
provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream:
{noformat}
scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter(
Collections.singletonList("id")
).asScala.toSeq;
 |  | res2: Seq[String] = Stream(id, ?)

scala> 
{noformat}
This seems to a bug in the handling of the join columns, but only in the case 
where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, 
SPARK-38221).

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44078) Add support for classloader/resource isolation

2023-06-21 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44078.
---
Fix Version/s: 3.5.0
 Assignee: Venkata Sai Akhil Gudesa
   Resolution: Fixed

> Add support for classloader/resource isolation
> --
>
> Key: SPARK-44078
> URL: https://issues.apache.org/jira/browse/SPARK-44078
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, Spark Core
>Affects Versions: 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0
>
>
> A current limitation of Scala UDFs is that a Spark cluster would only be able 
> to support a single REPL at a time due to the fact that classloaders of 
> different Spark Sessions (and therefore, Spark Connect sessions) aren't 
> isolated from each other. Without isolation, REPL-generated classfiles as 
> well as user-added JARs may conflict if there are multiple users of the 
> cluster.
> Thus, we need a mechanism to support isolated sessions (i.e isolated 
> resources/classloader) so that each REPL user does not conflict with other 
> users on the same cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44136) StateManager may get materialized in executor instead of driver in FlatMapGroupsWithStateExec

2023-06-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44136:


Assignee: Bo Gao

> StateManager may get materialized in executor instead of driver in 
> FlatMapGroupsWithStateExec
> -
>
> Key: SPARK-44136
> URL: https://issues.apache.org/jira/browse/SPARK-44136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Bo Gao
>Assignee: Bo Gao
>Priority: Major
>
> StateManager may get materialized in executor instead of driver in 
> FlatMapGroupsWithStateExec because of a previous change 
> https://issues.apache.org/jira/browse/SPARK-40411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44136) StateManager may get materialized in executor instead of driver in FlatMapGroupsWithStateExec

2023-06-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44136.
--
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 41693
[https://github.com/apache/spark/pull/41693]

> StateManager may get materialized in executor instead of driver in 
> FlatMapGroupsWithStateExec
> -
>
> Key: SPARK-44136
> URL: https://issues.apache.org/jira/browse/SPARK-44136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Bo Gao
>Assignee: Bo Gao
>Priority: Major
> Fix For: 3.5.0, 3.4.1
>
>
> StateManager may get materialized in executor instead of driver in 
> FlatMapGroupsWithStateExec because of a previous change 
> https://issues.apache.org/jira/browse/SPARK-40411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44139) Discard completely pushed down filters in group-based MERGE operations

2023-06-21 Thread Anton Okolnychyi (Jira)
Anton Okolnychyi created SPARK-44139:


 Summary: Discard completely pushed down filters in group-based 
MERGE operations
 Key: SPARK-44139
 URL: https://issues.apache.org/jira/browse/SPARK-44139
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Anton Okolnychyi


We need to discard completely pushed down filters in group-based MERGE 
operations to avoid evaluating them again on the Spark side for performance 
reasons.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43710) Support functions.date_part for Spark Connect

2023-06-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43710:


Assignee: Haejoon Lee

> Support functions.date_part for Spark Connect
> -
>
> Key: SPARK-43710
> URL: https://issues.apache.org/jira/browse/SPARK-43710
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Repro: run `TimedeltaIndexParityTests.test_properties`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44128) Upgrade netty from 4.1.92 to 4.1.93

2023-06-21 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44128:

Summary: Upgrade netty from 4.1.92 to 4.1.93  (was: Upgrade netty from 
4.1.92 to 4.1.94)

> Upgrade netty from 4.1.92 to 4.1.93
> ---
>
> Key: SPARK-44128
> URL: https://issues.apache.org/jira/browse/SPARK-44128
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44136) StateManager may get materialized in executor instead of driver in FlatMapGroupsWithStateExec

2023-06-21 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735926#comment-17735926
 ] 

Jungtaek Lim commented on SPARK-44136:
--

PR: https://github.com/apache/spark/pull/41693

> StateManager may get materialized in executor instead of driver in 
> FlatMapGroupsWithStateExec
> -
>
> Key: SPARK-44136
> URL: https://issues.apache.org/jira/browse/SPARK-44136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Bo Gao
>Priority: Major
>
> StateManager may get materialized in executor instead of driver in 
> FlatMapGroupsWithStateExec because of a previous change 
> https://issues.apache.org/jira/browse/SPARK-40411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44138) Prohibit non-deterministic expressions, subqueries and aggregates in MERGE conditions

2023-06-21 Thread Anton Okolnychyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-44138:
-
Summary: Prohibit non-deterministic expressions, subqueries and aggregates 
in MERGE conditions  (was: Prohibit non-deterministic conditions, subqueries 
and aggregates in MERGE conditions)

> Prohibit non-deterministic expressions, subqueries and aggregates in MERGE 
> conditions
> -
>
> Key: SPARK-44138
> URL: https://issues.apache.org/jira/browse/SPARK-44138
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We need to prohibit non-deterministic conditions, subqueries and aggregates 
> in MERGE conditions that are rewritten into executable plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44138) Prohibit non-deterministic conditions, subqueries and aggregates in MERGE conditions

2023-06-21 Thread Anton Okolnychyi (Jira)
Anton Okolnychyi created SPARK-44138:


 Summary: Prohibit non-deterministic conditions, subqueries and 
aggregates in MERGE conditions
 Key: SPARK-44138
 URL: https://issues.apache.org/jira/browse/SPARK-44138
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Anton Okolnychyi


We need to prohibit non-deterministic conditions, subqueries and aggregates in 
MERGE conditions that are rewritten into executable plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44137) Change handling of iterable objects for on field in joins

2023-06-21 Thread John Haberstroh (Jira)
John Haberstroh created SPARK-44137:
---

 Summary: Change handling of iterable objects for on field in joins
 Key: SPARK-44137
 URL: https://issues.apache.org/jira/browse/SPARK-44137
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.5.0
Reporter: John Haberstroh


The {{on}} field complained when I passed it a Tuple. That's because it saw 
that it checked for {{list}} exactly, and so wrapped it into a list like 
{{{}[on]{}}}, leading to immediate failure. This was surprising -- typically, 
tuple and list should be interchangeable, and typically tuple is the more 
readily accepted type. I have proposed a change that moves towards the 
principle of least surprise for this situation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44136) StateManager may get materialized in executor instead of driver in FlatMapGroupsWithStateExec

2023-06-21 Thread Bo Gao (Jira)
Bo Gao created SPARK-44136:
--

 Summary: StateManager may get materialized in executor instead of 
driver in FlatMapGroupsWithStateExec
 Key: SPARK-44136
 URL: https://issues.apache.org/jira/browse/SPARK-44136
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Bo Gao


StateManager may get materialized in executor instead of driver in 
FlatMapGroupsWithStateExec because of a previous change 
https://issues.apache.org/jira/browse/SPARK-40411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44133) Upgrade MyPy from 0.920 to 0.982

2023-06-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44133:
-

Assignee: Hyukjin Kwon

> Upgrade MyPy from 0.920 to 0.982
> 
>
> Key: SPARK-44133
> URL: https://issues.apache.org/jira/browse/SPARK-44133
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We're using MyPy 0.920. Should better use a higher version that has bug fixes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44133) Upgrade MyPy from 0.920 to 0.982

2023-06-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44133.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41690
[https://github.com/apache/spark/pull/41690]

> Upgrade MyPy from 0.920 to 0.982
> 
>
> Key: SPARK-44133
> URL: https://issues.apache.org/jira/browse/SPARK-44133
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>
> We're using MyPy 0.920. Should better use a higher version that has bug fixes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2023-06-21 Thread Michael Allman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735876#comment-17735876
 ] 

Michael Allman commented on SPARK-39375:


To be clear, Spark Connect will be an alternative or augmentation of the 
current method of connecting to a Spark cluster, not a replacement, right? RDDs 
are a strictly more powerful interface than DataFrames, and certain 
architectures, like GraphX, cannot be implemented in DataFrames.

FWIW, we have been connecting to Spark from Jupyter for years. We load/run 
PySpark in the Jupyter kernel.

> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42962) Allow access to stopped streaming queries

2023-06-21 Thread Wei Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735867#comment-17735867
 ] 

Wei Liu commented on SPARK-42962:
-

Closing this as it's duplicate with SPARK-42940. Please kindly reopen the 
ticket if this is a mistake! 

> Allow access to stopped streaming queries
> -
>
> Key: SPARK-42962
> URL: https://issues.apache.org/jira/browse/SPARK-42962
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Priority: Major
>
> In Spark connect, when a query is stopped, the client can't access it 
> anymore. That implies they can not fetch any information about the query 
> after that. 
> The server might have to cache the queries for some time (upto the time the 
> session is closed). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42962) Allow access to stopped streaming queries

2023-06-21 Thread Wei Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu resolved SPARK-42962.
-
Resolution: Duplicate

> Allow access to stopped streaming queries
> -
>
> Key: SPARK-42962
> URL: https://issues.apache.org/jira/browse/SPARK-42962
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Priority: Major
>
> In Spark connect, when a query is stopped, the client can't access it 
> anymore. That implies they can not fetch any information about the query 
> after that. 
> The server might have to cache the queries for some time (upto the time the 
> session is closed). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42941) Add support for streaming listener in Python

2023-06-21 Thread Wei Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735866#comment-17735866
 ] 

Wei Liu commented on SPARK-42941:
-

Hmmm this is actually still in progress. I was thinking two PRs could share the 
same Jira ticket... sorry if I did it wrong

> Add support for streaming listener in Python
> 
>
> Key: SPARK-42941
> URL: https://issues.apache.org/jira/browse/SPARK-42941
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> Add support of streaming listener in Python. 
> This likely requires a design doc to hash out the details. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-42941) Add support for streaming listener in Python

2023-06-21 Thread Wei Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu reopened SPARK-42941:
-

> Add support for streaming listener in Python
> 
>
> Key: SPARK-42941
> URL: https://issues.apache.org/jira/browse/SPARK-42941
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> Add support of streaming listener in Python. 
> This likely requires a design doc to hash out the details. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43915) Assign names to the error class _LEGACY_ERROR_TEMP_[2438-2445]

2023-06-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43915:


Assignee: jiaan.geng

> Assign names to the error class _LEGACY_ERROR_TEMP_[2438-2445]
> --
>
> Key: SPARK-43915
> URL: https://issues.apache.org/jira/browse/SPARK-43915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43915) Assign names to the error class _LEGACY_ERROR_TEMP_[2438-2445]

2023-06-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43915.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41553
[https://github.com/apache/spark/pull/41553]

> Assign names to the error class _LEGACY_ERROR_TEMP_[2438-2445]
> --
>
> Key: SPARK-43915
> URL: https://issues.apache.org/jira/browse/SPARK-43915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43924) Add misc functions to Scala and Python

2023-06-21 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735789#comment-17735789
 ] 

Ignite TC Bot commented on SPARK-43924:
---

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41689

> Add misc functions to Scala and Python
> --
>
> Key: SPARK-43924
> URL: https://issues.apache.org/jira/browse/SPARK-43924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * uuid
> * aes_encrypt
> * aes_decrypt
> * sha
> * input_file_block_length
> * input_file_block_start
> * reflect
> * java_method
> * version
> * typeof
> * stack
> * random
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Description: 
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

*{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}*

 

I have this in deployment.yaml of the app

 

*mainApplicationFile: "local:spark-assembly-1.0.jar"*

 

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

  was:
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

*{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}*

 

I have this in deployment.yaml of the app

 

*mainApplicationFile: "local:spark-assembly-1.0.jar"*

 

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

{{}}

{{}}


> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Blocker
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Priority: Blocker  (was: Critical)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Blocker
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Issue Type: Bug  (was: Improvement)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Critical
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Description: 
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}

 

I have this in deployment.yaml of the app

 

{\{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

{{}}

{{}}

  was:
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}

{{}}

{{}}

I have this in deployment.yaml of the app

 

{{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

{{}}

{{}}


> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Critical
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> {{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}
>  
> I have this in deployment.yaml of the app
>  
> {\{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Description: 
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

*{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}*

 

I have this in deployment.yaml of the app

 

*mainApplicationFile: "local:spark-assembly-1.0.jar"*

 

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

{{}}

{{}}

  was:
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}

 

I have this in deployment.yaml of the app

 

{\{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

{{}}

{{}}


> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Critical
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> *{{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}*
>  
> I have this in deployment.yaml of the app
>  
> *mainApplicationFile: "local:spark-assembly-1.0.jar"*
>  
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Component/s: Spark Core
 (was: Shuffle)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Critical
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> {{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}
> {{}}
> {{}}
> I have this in deployment.yaml of the app
>  
> {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Affects Version/s: (was: 3.2.0)

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.4.0
>Reporter: Ramakrishna
>Priority: Critical
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> {{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}
> {{}}
> {{}}
> I have this in deployment.yaml of the app
>  
> {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Description: 
 
I have a spark application that is deployed using k8s and it is of version 
3.3.2 Recently there were some vulneabilities in spark 3.3.2

I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
application jar is built on spark 3.4.0

However while deploying, I get this error

        

{{Exception in thread "main" java.nio.file.NoSuchFileException: 
/spark-assembly-1.0.jar}}

{{}}

{{}}

I have this in deployment.yaml of the app

 

{{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}

 

 

 

and I have not changed anything related to that. I see that some code has 
changed in spark 3.4.0 core's source code regarding jar location.

Has it really changed the functionality ? Is there anyone who is facing same 
issue as me ? Should the path be specified in a different way?

{{}}

{{}}

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.4.0
>Reporter: Ramakrishna
>Priority: Critical
>
>  
> I have a spark application that is deployed using k8s and it is of version 
> 3.3.2 Recently there were some vulneabilities in spark 3.3.2
> I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my 
> application jar is built on spark 3.4.0
> However while deploying, I get this error
>         
> {{Exception in thread "main" java.nio.file.NoSuchFileException: 
> /spark-assembly-1.0.jar}}
> {{}}
> {{}}
> I have this in deployment.yaml of the app
>  
> {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}}
>  
>  
>  
> and I have not changed anything related to that. I see that some code has 
> changed in spark 3.4.0 core's source code regarding jar location.
> Has it really changed the functionality ? Is there anyone who is facing same 
> issue as me ? Should the path be specified in a different way?
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna updated SPARK-44135:

Description: (was: In our production environment, 
_finalizeShuffleMerge_ processing took longer time (p90 is around 20s) than 
other PRC requests. This is due to _finalizeShuffleMerge_ invoking IO 
operations like truncate and file open/close.  

More importantly, processing this _finalizeShuffleMerge_ can block other 
critical lightweight messages like authentications, which can cause 
authentication timeout as well as fetch failures. Those timeout and fetch 
failures affect the stability of the Spark job executions. )

> Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" 
> java.nio.file.NoSuchFileException: , although jar is present in the location
> ---
>
> Key: SPARK-44135
> URL: https://issues.apache.org/jira/browse/SPARK-44135
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0, 3.4.0
>Reporter: Ramakrishna
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location

2023-06-21 Thread Ramakrishna (Jira)
Ramakrishna created SPARK-44135:
---

 Summary: Upgrade to spark 3.4.0 from 3.3.2 gives Exception in 
thread "main" java.nio.file.NoSuchFileException: , although jar is present in 
the location
 Key: SPARK-44135
 URL: https://issues.apache.org/jira/browse/SPARK-44135
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.2.0, 3.4.0
Reporter: Ramakrishna


In our production environment, _finalizeShuffleMerge_ processing took longer 
time (p90 is around 20s) than other PRC requests. This is due to 
_finalizeShuffleMerge_ invoking IO operations like truncate and file 
open/close.  

More importantly, processing this _finalizeShuffleMerge_ can block other 
critical lightweight messages like authentications, which can cause 
authentication timeout as well as fetch failures. Those timeout and fetch 
failures affect the stability of the Spark job executions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44056) Improve error message when UDF execution fails

2023-06-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44056:


Assignee: Rob Reeves

> Improve error message when UDF execution fails
> --
>
> Key: SPARK-44056
> URL: https://issues.apache.org/jira/browse/SPARK-44056
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Major
>
> If a user has multiple UDFs defined with the same method signature it is hard 
> to figure out which one caused the issue from the function class alone. For 
> example, in Spark 3.1.1:
> {code}
> Caused by: org.apache.spark.SparkException: Failed to execute user defined 
> function(UDFRegistration$$Lambda$666/1969461119: (bigint, string) => string)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.subExpr_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3(basicPhysicalOperators.scala:249)
>   at 
> org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3$adapted(basicPhysicalOperators.scala:248)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:131)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:523)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1535)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:526)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> This is the end of the stack trace. I didn't truncate it.
> {code}
> If the SQL API is used the ScalaUDF will have a name. It should be part of 
> the error to help debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44056) Improve error message when UDF execution fails

2023-06-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44056.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41599
[https://github.com/apache/spark/pull/41599]

> Improve error message when UDF execution fails
> --
>
> Key: SPARK-44056
> URL: https://issues.apache.org/jira/browse/SPARK-44056
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Major
> Fix For: 3.5.0
>
>
> If a user has multiple UDFs defined with the same method signature it is hard 
> to figure out which one caused the issue from the function class alone. For 
> example, in Spark 3.1.1:
> {code}
> Caused by: org.apache.spark.SparkException: Failed to execute user defined 
> function(UDFRegistration$$Lambda$666/1969461119: (bigint, string) => string)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.subExpr_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3(basicPhysicalOperators.scala:249)
>   at 
> org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3$adapted(basicPhysicalOperators.scala:248)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:131)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:523)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1535)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:526)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> This is the end of the stack trace. I didn't truncate it.
> {code}
> If the SQL API is used the ScalaUDF will have a name. It should be part of 
> the error to help debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44109) Remove duplicate preferred locations of each RDD partition

2023-06-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44109.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41676
[https://github.com/apache/spark/pull/41676]

> Remove duplicate preferred locations of each RDD partition
> --
>
> Key: SPARK-44109
> URL: https://issues.apache.org/jira/browse/SPARK-44109
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
> Fix For: 3.5.0
>
>
> DAGScheduler will get the preferred locations for each RDD partition and try 
> to allocate the task on the preferred locations.
> We can remove the duplicate preferred locations to save memory.
> For example. reduce 0 needs to fetch map0 output and map1 output in host-A, 
> then the preferred locations can be Array("host-A").



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44109) Remove duplicate preferred locations of each RDD partition

2023-06-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44109:
-

Assignee: Wan Kun

> Remove duplicate preferred locations of each RDD partition
> --
>
> Key: SPARK-44109
> URL: https://issues.apache.org/jira/browse/SPARK-44109
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
>
> DAGScheduler will get the preferred locations for each RDD partition and try 
> to allocate the task on the preferred locations.
> We can remove the duplicate preferred locations to save memory.
> For example. reduce 0 needs to fetch map0 output and map1 output in host-A, 
> then the preferred locations can be Array("host-A").



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44042) SPIP: PySpark Test Framework

2023-06-21 Thread Amanda Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735747#comment-17735747
 ] 

Amanda Liu commented on SPARK-44042:


[~ste...@apache.org] thank you for the comment! i agree that test output 
messages are critical here. thanks also for the hadoop-api-shim example, that's 
helpful to look at

> SPIP: PySpark Test Framework
> 
>
> Key: SPARK-44042
> URL: https://issues.apache.org/jira/browse/SPARK-44042
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> Currently, there's no official PySpark test framework, but only various 
> open-source repos and blog posts. Many of these open-source resources are 
> very popular, which demonstrates user-demand for PySpark testing 
> capabilities. 
> [spark-testing-base|https://github.com/holdenk/spark-testing-base] has 1.4k 
> stars, and [chispa|https://github.com/MrPowers/chispa] has 532k 
> downloads/month. However, it can be confusing for users to piece together 
> disparate resources to write their own PySpark tests (see [The Elephant in 
> the Room: How to Write PySpark 
> Tests|https://towardsdatascience.com/the-elephant-in-the-room-how-to-write-pyspark-unit-tests-a5073acabc34]).
>  We can streamline and simplify the testing process by incorporating test 
> features, such as a PySpark Test Base class (which allows tests to share 
> Spark sessions) and test util functions (for example, asserting dataframe and 
> schema equality). Please see the full SPIP document attached: 
> [https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf

2023-06-21 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-44134:
-

 Summary: Can't set resources (GPU/FPGA) to 0 when they are set to 
positive value in spark-defaults.conf
 Key: SPARK-44134
 URL: https://issues.apache.org/jira/browse/SPARK-44134
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Thomas Graves


With resource aware scheduling, if you specify a default value in the 
spark-defaults.conf, a user can't override that to set it to 0.

Meaning spark-defaults.conf has something like:
{{spark.executor.resource.\{resourceName}.amount=1}}

{{spark.task.resource.\{resourceName}.amount}} =1

{{}}

If the user tries to override when submitting an application with 
{{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and 
{{{}{}}}{{{}spark.task.resource.\{resourceName}.amount{}}}{{ =0, it gives the 
user an error:}}

{{}}
{code:java}
23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session.
org.apache.spark.SparkException: No executor resource configs were not 
specified for the following task configs: gpu
        at 
org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206)
        at 
org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139)
        at scala.Option.getOrElse(Option.scala:189)
        at 
org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138)
        at 
org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95)
        at 
org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49)
        at org.apache.spark.SparkContext.(SparkContext.scala:455)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
        at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code}
This used to work, my guess is this may have gotten broken with the stage level 
scheduling feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf

2023-06-21 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735746#comment-17735746
 ] 

Thomas Graves commented on SPARK-44134:
---

I'm working on a fix for this

> Can't set resources (GPU/FPGA) to 0 when they are set to positive value in 
> spark-defaults.conf
> --
>
> Key: SPARK-44134
> URL: https://issues.apache.org/jira/browse/SPARK-44134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Thomas Graves
>Priority: Major
>
> With resource aware scheduling, if you specify a default value in the 
> spark-defaults.conf, a user can't override that to set it to 0.
> Meaning spark-defaults.conf has something like:
> {{spark.executor.resource.\{resourceName}.amount=1}}
> {{spark.task.resource.\{resourceName}.amount}} =1
> {{}}
> If the user tries to override when submitting an application with 
> {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and 
> {{{}{}}}{{{}spark.task.resource.\{resourceName}.amount{}}}{{ =0, it gives the 
> user an error:}}
> {{}}
> {code:java}
> 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session.
> org.apache.spark.SparkException: No executor resource configs were not 
> specified for the following task configs: gpu
>         at 
> org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206)
>         at 
> org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139)
>         at scala.Option.getOrElse(Option.scala:189)
>         at 
> org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138)
>         at 
> org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95)
>         at 
> org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49)
>         at org.apache.spark.SparkContext.(SparkContext.scala:455)
>         at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
>         at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code}
> This used to work, my guess is this may have gotten broken with the stage 
> level scheduling feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41599) Memory leak in FileSystem.CACHE when submitting apps to secure cluster using InProcessLauncher

2023-06-21 Thread Xieming Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735743#comment-17735743
 ] 

Xieming Li commented on SPARK-41599:


Hi, [~ste...@apache.org], Thank you very much for your advice.

After reviewing the code, I think the following PR should be able to fix the 
issue.

May I ask you to take a look at it when you have time, please?

[https://github.com/apache/spark/pull/41692]

 

> Memory leak in FileSystem.CACHE when submitting apps to secure cluster using 
> InProcessLauncher
> --
>
> Key: SPARK-41599
> URL: https://issues.apache.org/jira/browse/SPARK-41599
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, YARN
>Affects Versions: 3.1.2
>Reporter: Maciej Smolenski
>Priority: Major
> Attachments: InProcLaunchFsIssue.scala, 
> SPARK-41599-fixes-to-limit-FileSystem-CACHE-size-when-using-InProcessLauncher.diff
>
>
> When submitting spark application in kerberos environment the credentials of 
> 'current user' (UserGroupInformation.getCurrentUser()) are being modified.
> Filesystem.CACHE entries contain 'current user' (with user credentials) as a 
> key.
> Submitting many spark applications using InProcessLauncher cause that 
> FileSystem.CACHE becomes bigger and bigger.
> Finally process exits because of OutOfMemory error.
> Code for reproduction attached.
>  
> Output from running 'jmap -histo' on reproduction jvm shows that the number 
> of FileSystem$Cache$Key increases in time:
> time: #instances class
> 1671533274: 2 org.apache.hadoop.fs.FileSystem$Cache$Key
> 167155: 11 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533395: 21 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533455: 30 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533515: 39 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533576: 48 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533636: 57 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533696: 66 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533757: 75 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533817: 84 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533877: 93 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533937: 102 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671533998: 111 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534058: 120 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534118: 135 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534178: 140 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534239: 150 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534299: 159 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534359: 168 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534419: 177 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534480: 186 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534540: 195 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534600: 204 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534661: 213 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534721: 222 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534781: 231 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534841: 240 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534902: 249 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671534962: 257 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671535022: 264 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671535083: 273 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671535143: 282 org.apache.hadoop.fs.FileSystem$Cache$Key
> 1671535203: 291 org.apache.hadoop.fs.FileSystem$Cache$Key



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42740) Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-06-21 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-42740.

Resolution: Resolved

> Fix the bug that pushdown offset or paging is invalid for some built-in 
> dialect 
> 
>
> Key: SPARK-42740
> URL: https://issues.apache.org/jira/browse/SPARK-42740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, the default pushdown offset like OFFSET n. But some built-in 
> dialect doesn't support the syntax. So when the Spark pushdown offset into 
> these databases, them  throwing errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42740) Fix the bug that pushdown offset or paging is invalid for some built-in dialect

2023-06-21 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735688#comment-17735688
 ] 

jiaan.geng commented on SPARK-42740:


Resolved by https://github.com/apache/spark/pull/40359

> Fix the bug that pushdown offset or paging is invalid for some built-in 
> dialect 
> 
>
> Key: SPARK-42740
> URL: https://issues.apache.org/jira/browse/SPARK-42740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, the default pushdown offset like OFFSET n. But some built-in 
> dialect doesn't support the syntax. So when the Spark pushdown offset into 
> these databases, them  throwing errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44042) SPIP: PySpark Test Framework

2023-06-21 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735646#comment-17735646
 ] 

Steve Loughran commented on SPARK-44042:


* you can create an independent git repo for this (ASF self service) and so 
have a different release timetable and get it out to older versions. 
* This does require testing across versions, but separate modules can do that. 
see https://github.com/apache/hadoop-api-shim for an example. Base module has 
tests, separate mvn module to run those tests against a given version
* testing test runners is fun
* anything which can be done to provide meaningful reports, stack traces, *and 
set a good example to users*  is critical. The design goal should be (for all 
test frameworks, IMO), "does the output report on its own provide enough 
information to diagnose and fix the failure" -the way junit  baseline "assert 
failure at line 308" doesn't

> SPIP: PySpark Test Framework
> 
>
> Key: SPARK-44042
> URL: https://issues.apache.org/jira/browse/SPARK-44042
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> Currently, there's no official PySpark test framework, but only various 
> open-source repos and blog posts. Many of these open-source resources are 
> very popular, which demonstrates user-demand for PySpark testing 
> capabilities. 
> [spark-testing-base|https://github.com/holdenk/spark-testing-base] has 1.4k 
> stars, and [chispa|https://github.com/MrPowers/chispa] has 532k 
> downloads/month. However, it can be confusing for users to piece together 
> disparate resources to write their own PySpark tests (see [The Elephant in 
> the Room: How to Write PySpark 
> Tests|https://towardsdatascience.com/the-elephant-in-the-room-how-to-write-pyspark-unit-tests-a5073acabc34]).
>  We can streamline and simplify the testing process by incorporating test 
> features, such as a PySpark Test Base class (which allows tests to share 
> Spark sessions) and test util functions (for example, asserting dataframe and 
> schema equality). Please see the full SPIP document attached: 
> [https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43903) Improve ArrayType input support in Arrow Python UDF

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735623#comment-17735623
 ] 

ASF GitHub Bot commented on SPARK-43903:


User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/41603

> Improve ArrayType input support in Arrow Python UDF
> ---
>
> Key: SPARK-43903
> URL: https://issues.apache.org/jira/browse/SPARK-43903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43903) Improve ArrayType input support in Arrow Python UDF

2023-06-21 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43903.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41603
[https://github.com/apache/spark/pull/41603]

> Improve ArrayType input support in Arrow Python UDF
> ---
>
> Key: SPARK-43903
> URL: https://issues.apache.org/jira/browse/SPARK-43903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43903) Improve ArrayType input support in Arrow Python UDF

2023-06-21 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43903:
-

Assignee: Xinrong Meng

> Improve ArrayType input support in Arrow Python UDF
> ---
>
> Key: SPARK-43903
> URL: https://issues.apache.org/jira/browse/SPARK-43903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44133) Upgrade MyPy from 0.920 to 0.982

2023-06-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-44133:


 Summary: Upgrade MyPy from 0.920 to 0.982
 Key: SPARK-44133
 URL: https://issues.apache.org/jira/browse/SPARK-44133
 Project: Spark
  Issue Type: Task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


We're using MyPy 0.920. Should better use a higher version that has bug fixes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Steven Aerts (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Aerts updated SPARK-44132:
-
Description: 
We are seeing issues with the code generator when querying java bean encoded 
data with 2 nested joins.
{code:java}
dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
{code}
will generate invalid code in the code generator.  And can depending on the 
data used generate stack traces like:
{code:java}
 Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
Or:
{code:java}
 Caused by: java.lang.AssertionError: index (2) should < 2
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
When we look at the generated code we see that the code generator seems to be 
mixing up parameters.  For example:
{code:java}
if (smj_leftOutputRow_0 != null) {  //< null check 
for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes NPE 
on right parameter here{code}
It is as if the the nesting of 2 full outer joins is confusing the code 
generator and as such generating invalid code.

There is one other strange thing.  We found this issue when using data sets 
which were using the java bean encoder.  We tried to reproduce this in the 
spark shell or using scala case classes but were unable to do so. 

We made a reproduction scenario as unit tests (one for each of the stacktrace 
above) on the spark code base and made it available as a [pull 
request|https://github.com/apache/spark/pull/41688] to this case.

  was:
We are seeing issues with the code generator when querying java bean encoded 
data with 2 nested joins.
{code:java}
dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
{code}
will generate invalid code in the code generator.  And can depending on the 
data used generate stack traces like:
{code:java}
 Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
Or:
{code:java}
 Caused by: java.lang.AssertionError: index (2) should < 2
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
When we look at the generated code we see that the code generator seems to be 
mixing up parameters.  For example:
{code:java}
if (smj_leftOutputRow_0 != null) {  //< null check 
for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes NPE 
on right parameter here{code}
It is as if the the nesting of 2 full outer joins is confusing the code 
generator and as such generating invalid code.

There is one other strange thing.  We found this issue when using data sets 
which were using the java bean encoder.  We tried to reproduce this in the 
spark shell or using scala case classes but were unable to do so. 

We made a reproduction scenario as unit 

[jira] [Updated] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Steven Aerts (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Aerts updated SPARK-44132:
-
Description: 
We are seeing issues with the code generator when querying java bean encoded 
data with 2 nested joins.
{code:java}
dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
{code}
will generate invalid code in the code generator.  And can depending on the 
data used generate stack traces like:
{code:java}
 Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
Or:
{code:java}
 Caused by: java.lang.AssertionError: index (2) should < 2
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
When we look at the generated code we see that the code generator seems to be 
mixing up parameters.  For example:
{code:java}
if (smj_leftOutputRow_0 != null) {  //< null check 
for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes NPE 
on right parameter here{code}
It is as if the the nesting of 2 full outer joins is confusing the code 
generator and as such generating invalid code.

There is one other strange thing.  We found this issue when using data sets 
which were using the java bean encoder.  We tried to reproduce this in the 
spark shell or using scala case classes but were unable to do so. 

We made a reproduction scenario as unit tests (one for each of the stacktrace 
above) on the spark code base and made it available as a[ pull 
request|https://github.com/apache/spark/pull/41688] to this case.

  was:
We are seeing issues with the code generator when querying java bean encoded 
data with 2 nested joins.
{code:java}
dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
{code}
will generate invalid code in the code generator.  And can depending on the 
data used generate stack traces like:
{code:java}
 Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
Or:
{code:java}
 Caused by: java.lang.AssertionError: index (2) should < 2
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
When we look at the generated code we see that the code generator seems to be 
mixing up parameters.  For example:
{code:java}
if (smj_leftOutputRow_0 != null) {  //< null check 
for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes NPE 
on right parameter here{code}
It is as if the the nesting of 2 full outer joins is confusing the code 
generator and as such generating invalid code.

There is one other strange thing.  We found this issue when using data sets 
which were using the java bean encoder.  We tried to reproduce this in the 
spark shell or using scala case classes but were unable to do so. 

We made a reproduction scenario as unit 

[jira] [Updated] (SPARK-44132) nesting full outer joins confuses code generator

2023-06-21 Thread Steven Aerts (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Aerts updated SPARK-44132:
-
Summary: nesting full outer joins confuses code generator  (was: nesting 
full outer joins confuses confuses code generator)

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and will make it available as a pull request to 
> this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44004) Assign name & improve error message for frequent LEGACY errors.

2023-06-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44004:
-
Fix Version/s: 3.5.0
   (was: 4.0.0)

> Assign name & improve error message for frequent LEGACY errors.
> ---
>
> Key: SPARK-44004
> URL: https://issues.apache.org/jira/browse/SPARK-44004
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> This addresses _LEGACY_ERROR_TEMP_1333, _LEGACY_ERROR_TEMP_2331, 
> _LEGACY_ERROR_TEMP_0023, _LEGACY_ERROR_TEMP_1157, _LEGACY_ERROR_TEMP_2308, 
> _LEGACY_ERROR_TEMP_1051, _LEGACY_ERROR_TEMP_1029, _LEGACY_ERROR_TEMP_1318



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44004) Assign name & improve error message for frequent LEGACY errors.

2023-06-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44004:


Assignee: Haejoon Lee

> Assign name & improve error message for frequent LEGACY errors.
> ---
>
> Key: SPARK-44004
> URL: https://issues.apache.org/jira/browse/SPARK-44004
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> This addresses _LEGACY_ERROR_TEMP_1333, _LEGACY_ERROR_TEMP_2331, 
> _LEGACY_ERROR_TEMP_0023, _LEGACY_ERROR_TEMP_1157, _LEGACY_ERROR_TEMP_2308, 
> _LEGACY_ERROR_TEMP_1051, _LEGACY_ERROR_TEMP_1029, _LEGACY_ERROR_TEMP_1318



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44004) Assign name & improve error message for frequent LEGACY errors.

2023-06-21 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44004.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41504
[https://github.com/apache/spark/pull/41504]

> Assign name & improve error message for frequent LEGACY errors.
> ---
>
> Key: SPARK-44004
> URL: https://issues.apache.org/jira/browse/SPARK-44004
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> This addresses _LEGACY_ERROR_TEMP_1333, _LEGACY_ERROR_TEMP_2331, 
> _LEGACY_ERROR_TEMP_0023, _LEGACY_ERROR_TEMP_1157, _LEGACY_ERROR_TEMP_2308, 
> _LEGACY_ERROR_TEMP_1051, _LEGACY_ERROR_TEMP_1029, _LEGACY_ERROR_TEMP_1318



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44132) nesting full outer joins confuses confuses code generator

2023-06-21 Thread Steven Aerts (Jira)
Steven Aerts created SPARK-44132:


 Summary: nesting full outer joins confuses confuses code generator
 Key: SPARK-44132
 URL: https://issues.apache.org/jira/browse/SPARK-44132
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0, 3.3.0, 3.5.0
 Environment: We verified the existence of this bug from spark 3.3 
until spark 3.5.
Reporter: Steven Aerts


We are seeing issues with the code generator when querying java bean encoded 
data with 2 nested joins.
{code:java}
dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
{code}
will generate invalid code in the code generator.  And can depending on the 
data used generate stack traces like:
{code:java}
 Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
Or:
{code:java}
 Caused by: java.lang.AssertionError: index (2) should < 2
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
When we look at the generated code we see that the code generator seems to be 
mixing up parameters.  For example:
{code:java}
if (smj_leftOutputRow_0 != null) {  //< null check 
for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes NPE 
on right parameter here{code}
It is as if the the nesting of 2 full outer joins is confusing the code 
generator and as such generating invalid code.

There is one other strange thing.  We found this issue when using data sets 
which were using the java bean encoder.  We tried to reproduce this in the 
spark shell or using scala case classes but were unable to do so. 

We made a reproduction scenario as unit tests (one for each of the stacktrace 
above) on the spark code base and will make it available as a pull request to 
this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-44107) Hide unsupported Column methods from auto-completion

2023-06-21 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reopened SPARK-44107:
---

> Hide unsupported Column methods from auto-completion
> 
>
> Key: SPARK-44107
> URL: https://issues.apache.org/jira/browse/SPARK-44107
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44107) Hide unsupported Column methods from auto-completion

2023-06-21 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44107.
---
Resolution: Fixed

> Hide unsupported Column methods from auto-completion
> 
>
> Key: SPARK-44107
> URL: https://issues.apache.org/jira/browse/SPARK-44107
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44107) Hide unsupported Column methods from auto-completion

2023-06-21 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735566#comment-17735566
 ] 

Ruifeng Zheng commented on SPARK-44107:
---

resolved in https://github.com/apache/spark/pull/41675

> Hide unsupported Column methods from auto-completion
> 
>
> Key: SPARK-44107
> URL: https://issues.apache.org/jira/browse/SPARK-44107
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44107) Hide unsupported Column methods from auto-completion

2023-06-21 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44107.
---
Resolution: Resolved

> Hide unsupported Column methods from auto-completion
> 
>
> Key: SPARK-44107
> URL: https://issues.apache.org/jira/browse/SPARK-44107
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44106) Add `__repr__` for `GroupedData`

2023-06-21 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44106.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41674
[https://github.com/apache/spark/pull/41674]

> Add `__repr__` for `GroupedData`
> 
>
> Key: SPARK-44106
> URL: https://issues.apache.org/jira/browse/SPARK-44106
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44106) Add `__repr__` for `GroupedData`

2023-06-21 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44106:
-

Assignee: Ruifeng Zheng

> Add `__repr__` for `GroupedData`
> 
>
> Key: SPARK-44106
> URL: https://issues.apache.org/jira/browse/SPARK-44106
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44114) Upgrade built-in Hive to 3+

2023-06-21 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735557#comment-17735557
 ] 

Yuming Wang commented on SPARK-44114:
-

There is a lot of work to upgrade Hive, but the user experience cannot be 
significantly improved.

> Upgrade built-in Hive to 3+
> ---
>
> Key: SPARK-44114
> URL: https://issues.apache.org/jira/browse/SPARK-44114
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43511) Implemented State APIs for Spark Connect Scala

2023-06-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43511.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41558
[https://github.com/apache/spark/pull/41558]

> Implemented State APIs for Spark Connect Scala
> --
>
> Key: SPARK-43511
> URL: https://issues.apache.org/jira/browse/SPARK-43511
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Bo Gao
>Assignee: Bo Gao
>Priority: Major
> Fix For: 3.5.0
>
>
> Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark 
> Connect Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43511) Implemented State APIs for Spark Connect Scala

2023-06-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43511:


Assignee: Bo Gao

> Implemented State APIs for Spark Connect Scala
> --
>
> Key: SPARK-43511
> URL: https://issues.apache.org/jira/browse/SPARK-43511
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Bo Gao
>Assignee: Bo Gao
>Priority: Major
>
> Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark 
> Connect Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org