[jira] [Resolved] (SPARK-43926) Add array_agg, array_size, cardinality, count_min_sketch,mask,named_struct,json_* to Scala and Python

2023-06-29 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43926.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41718
[https://github.com/apache/spark/pull/41718]

> Add array_agg, array_size, cardinality, 
> count_min_sketch,mask,named_struct,json_* to Scala and Python
> -
>
> Key: SPARK-43926
> URL: https://issues.apache.org/jira/browse/SPARK-43926
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Tengfei Huang
>Priority: Major
> Fix For: 3.5.0
>
>
> Add array_agg, array_size, cardinality, count_min_sketch
> Add following functions:
> * array_agg
> * array_size
> * cardinality
> * count_min_sketch
> * named_struct
> * json_array_length
> * json_object_keys
> * mask
>   to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43926) Add array_agg, array_size, cardinality, count_min_sketch,mask,named_struct,json_* to Scala and Python

2023-06-29 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43926:
-

Assignee: Tengfei Huang

> Add array_agg, array_size, cardinality, 
> count_min_sketch,mask,named_struct,json_* to Scala and Python
> -
>
> Key: SPARK-43926
> URL: https://issues.apache.org/jira/browse/SPARK-43926
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Tengfei Huang
>Priority: Major
>
> Add array_agg, array_size, cardinality, count_min_sketch
> Add following functions:
> * array_agg
> * array_size
> * cardinality
> * count_min_sketch
> * named_struct
> * json_array_length
> * json_object_keys
> * mask
>   to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44259) Make `connect-jvm-client` module pass except arrow-related ones in Java 21

2023-06-29 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44259:
-
Summary: Make `connect-jvm-client` module pass except arrow-related ones in 
Java 21  (was: Ignore all Arrow-based connect tests for Java 21)

> Make `connect-jvm-client` module pass except arrow-related ones in Java 21
> --
>
> Key: SPARK-44259
> URL: https://issues.apache.org/jira/browse/SPARK-44259
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44223) Drop leveldb support

2023-06-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738867#comment-17738867
 ] 

Dongjoon Hyun commented on SPARK-44223:
---

+1

> Drop leveldb support
> 
>
> Key: SPARK-44223
> URL: https://issues.apache.org/jira/browse/SPARK-44223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>
> The leveldb project seems to be no longer maintained, and we can always 
> replace it with rocksdb. I think we can remove support and dependencies on 
> leveldb in Spark 4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-06-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738866#comment-17738866
 ] 

Dongjoon Hyun commented on SPARK-44124:
---

Hi, [~cltlfcjin]. We want to align with Apache Hadoop.

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-06-29 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738865#comment-17738865
 ] 

Lantao Jin commented on SPARK-44124:


Hi [~dongjoon], this is Lantao from AWS, we are working on upgrading AWS SDK v2 
for Spark recently. I am going to take this issue. Will upload a document next 
week.

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44259) Ignore all Arrow-based connect tests for Java 21

2023-06-29 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738864#comment-17738864
 ] 

Snoot.io commented on SPARK-44259:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41805

> Ignore all Arrow-based connect tests for Java 21
> 
>
> Key: SPARK-44259
> URL: https://issues.apache.org/jira/browse/SPARK-44259
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44260) Assign names to the error class _LEGACY_ERROR_TEMP_[1215-1245-2329] & Use checkError() to check Exception in *CharVarchar*Suite

2023-06-29 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44260:
---

 Summary: Assign names to the error class 
_LEGACY_ERROR_TEMP_[1215-1245-2329] & Use checkError() to check Exception in 
*CharVarchar*Suite
 Key: SPARK-44260
 URL: https://issues.apache.org/jira/browse/SPARK-44260
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44259) Ignore all Arrow-based connect tests for Java 21

2023-06-29 Thread Yang Jie (Jira)
Yang Jie created SPARK-44259:


 Summary: Ignore all Arrow-based connect tests for Java 21
 Key: SPARK-44259
 URL: https://issues.apache.org/jira/browse/SPARK-44259
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43974) Upgrade buf to v1.23.0

2023-06-29 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43974:

Summary: Upgrade buf to v1.23.0  (was: Upgrade buf to v1.22.0)

> Upgrade buf to v1.23.0
> --
>
> Key: SPARK-43974
> URL: https://issues.apache.org/jira/browse/SPARK-43974
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44090) Add a Java 21 build task in `build_and_test.yml`

2023-06-29 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44090.
--
Resolution: Duplicate

> Add a Java 21 build task in `build_and_test.yml`
> 
>
> Key: SPARK-44090
> URL: https://issues.apache.org/jira/browse/SPARK-44090
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44258) Move Metadata to sql/api

2023-06-29 Thread Rui Wang (Jira)
Rui Wang created SPARK-44258:


 Summary: Move Metadata to sql/api
 Key: SPARK-44258
 URL: https://issues.apache.org/jira/browse/SPARK-44258
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44257) Update some maven plugins & scalafmt to newest version

2023-06-29 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738838#comment-17738838
 ] 

BingKun Pan commented on SPARK-44257:
-

I work on it

> Update some maven plugins & scalafmt to newest version
> --
>
> Key: SPARK-44257
> URL: https://issues.apache.org/jira/browse/SPARK-44257
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44257) Update some maven plugins & scalafmt to newest version

2023-06-29 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44257:
---

 Summary: Update some maven plugins & scalafmt to newest version
 Key: SPARK-44257
 URL: https://issues.apache.org/jira/browse/SPARK-44257
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44256) Upgrade rocksdbjni to 8.3.2

2023-06-29 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44256:
---

 Summary: Upgrade rocksdbjni to 8.3.2
 Key: SPARK-44256
 URL: https://issues.apache.org/jira/browse/SPARK-44256
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44150) Explicit Arrow casting for mismatched return type in Arrow Python UDF

2023-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44150:
--
Fix Version/s: (was: 3.5.0)

> Explicit Arrow casting for mismatched return type in Arrow Python UDF
> -
>
> Key: SPARK-44150
> URL: https://issues.apache.org/jira/browse/SPARK-44150
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-44150) Explicit Arrow casting for mismatched return type in Arrow Python UDF

2023-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-44150:
---
  Assignee: (was: Xinrong Meng)

> Explicit Arrow casting for mismatched return type in Arrow Python UDF
> -
>
> Key: SPARK-44150
> URL: https://issues.apache.org/jira/browse/SPARK-44150
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44252) Add error class for the case when loading state from DFS fails

2023-06-29 Thread Lucy Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738807#comment-17738807
 ] 

Lucy Yao commented on SPARK-44252:
--

I'm working on this

> Add error class for the case when loading state from DFS fails
> --
>
> Key: SPARK-44252
> URL: https://issues.apache.org/jira/browse/SPARK-44252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Lucy Yao
>Priority: Major
>
> This is part of 
> [https://github.com/apache/spark/pull/41705|https://github.com/apache/spark/pull/41705.].
> Wrap the exception during the loading state, to assign error class properly. 
> With assigning error class, we can classify the errors which help us to 
> determine what errors customers are struggling much. 
> StateStoreProvider.getStore() & StateStoreProvider.getReadStore() is the 
> entry point.
> This ticket also covers failedToReadDeltaFileError and 
> failedToReadSnapshotFileError from 
> [https://issues.apache.org/jira/browse/SPARK-36305|http://example.com/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44248) Kafka Source v2 should return preferred locations

2023-06-29 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-44248.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41790
[https://github.com/apache/spark/pull/41790]

> Kafka Source v2 should return preferred locations
> -
>
> Key: SPARK-44248
> URL: https://issues.apache.org/jira/browse/SPARK-44248
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
> Fix For: 3.5.0
>
>
> DSv2 Kafka streaming source seems to miss setting the preferred location, 
> which may destroy the purpose of cache for Kafka consumer (connection) & 
> fetched data.
> For DSv1, we have set the preferred location in RDD.
> For DSv2, we should provide the info. in input partition, but we don't add 
> the information into KafkaBatchInputPartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44248) Kafka Source v2 should return preferred locations

2023-06-29 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-44248:


Assignee: Siying Dong

> Kafka Source v2 should return preferred locations
> -
>
> Key: SPARK-44248
> URL: https://issues.apache.org/jira/browse/SPARK-44248
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>
> DSv2 Kafka streaming source seems to miss setting the preferred location, 
> which may destroy the purpose of cache for Kafka consumer (connection) & 
> fetched data.
> For DSv1, we have set the preferred location in RDD.
> For DSv2, we should provide the info. in input partition, but we don't add 
> the information into KafkaBatchInputPartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44255) Relocate StorageLevel to common/utils

2023-06-29 Thread Rui Wang (Jira)
Rui Wang created SPARK-44255:


 Summary: Relocate StorageLevel to common/utils
 Key: SPARK-44255
 URL: https://issues.apache.org/jira/browse/SPARK-44255
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44255) Relocate StorageLevel to common/utils

2023-06-29 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-44255:
-
Issue Type: Task  (was: Bug)

> Relocate StorageLevel to common/utils
> -
>
> Key: SPARK-44255
> URL: https://issues.apache.org/jira/browse/SPARK-44255
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44253) Potential memory leak when temp views created from DF created by structured streaming

2023-06-29 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44253:
---
Description: 
If the user registers a temporary view from a dataframe created by Structured 
Streaming and tries to drop the temporary view via his original SparkSession 
then memory will be leaking.

The reason is Structured streaming has its own SparkSession (as a clone of the 
original SparkSession, for details see 
https://issues.apache.org/jira/browse/SPARK-26586 and 
[https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194])
 the created temporary view belongs the cloned SparkSession and the dropping of 
the temporary view must be done via the cloned SparkSession.

Example for the {*}memory leak{*}:
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  spark.catalog.dropTempView(view)
}
{noformat}
*Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
dataframe created by streaming):
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  batchDF.sparkSession.catalog.dropTempView(view)
 }
{noformat}
h4. Example heap dump

The SparkSession with the leak:

!1.png|width=807,height=120!

The two SparkSession instances where the first one was is the original 
SparkSession created by the user and the second is the clone:
!2.png|width=813,height=157!

  was:
If the user registers a temporary view from a dataframe created by Structured 
Streaming and tries to drop the temporary view via his original SparkSession 
then memory will be leaking.

The reason is Structured streaming has its own SparkSession (as a clone of the 
original Spark Session, for details see 
https://issues.apache.org/jira/browse/SPARK-26586 and 
[https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194])
 the created temporary view belongs the cloned SparkSession and the dropping of 
the temporary view must be done via the cloned SparkSession.

Example for the {*}memory leak{*}:
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  spark.catalog.dropTempView(view)
}
{noformat}
*Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
dataframe created by streaming):
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  batchDF.sparkSession.catalog.dropTempView(view)
 }
{noformat}
h4. Example heap dump

The SparkSession with the leak:

!1.png|width=807,height=120!

The two SparkSession instances where the first one was is the original 
SparkSession created by the user and the second is the clone:
!2.png|width=813,height=157!


> Potential memory leak when temp views created from DF created by structured 
> streaming
> -
>
> Key: SPARK-44253
> URL: https://issues.apache.org/jira/browse/SPARK-44253
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Attila Zsolt Piros
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> If the user registers a temporary view from a dataframe created by Structured 
> Streaming and tries to drop the temporary view via his original SparkSession 
> then memory will be leaking.
> The reason is Structured streaming has its own SparkSession (as a clone of 
> the original SparkSession, for details see 
> https://issues.apache.org/jira/browse/SPARK-26586 and 
> [https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194])
>  the created temporary view belongs the cloned SparkSession and the dropping 
> of the temporary view must be done via the cloned SparkSession.
> Example for the {*}memory leak{*}:
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   spark.catalog.dropTempView(view)
> }
> {noformat}
> *Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
> dataframe created by streaming):
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 

[jira] [Updated] (SPARK-44253) Potential memory leak when temp views created from DF created by structured streaming

2023-06-29 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44253:
---
Description: 
If the user registers a temporary view from a dataframe created by Structured 
Streaming and tries to drop the temporary view via his original SparkSession 
then memory will be leaking.

The reason is Structured streaming has its own SparkSession (as a clone of the 
original Spark Session, for details see 
https://issues.apache.org/jira/browse/SPARK-26586 and 
[https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194])
 the created temporary view belongs the cloned SparkSession and the dropping of 
the temporary view must be done via the cloned SparkSession.

Example for the {*}memory leak{*}:
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  spark.catalog.dropTempView(view)
}
{noformat}
*Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
dataframe created by streaming):
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  batchDF.sparkSession.catalog.dropTempView(view)
 }
{noformat}
h4. Example heap dump

The SparkSession with the leak:

!1.png|width=807,height=120!

The two SparkSession instances where the first one was is the original 
SparkSession created by the user and the second is the clone:
!2.png|width=813,height=157!

  was:
If the user registers a temporary view from a dataframe created by Structured 
Streaming and tries to drop the temporary view via his original SparkSession 
then memory will be leaking.

The reason is Structured streaming has its own SparkSession (as a clone of the 
original Spark Session, for details see 
https://issues.apache.org/jira/browse/SPARK-26586 and 
https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194)
 the created temporary view belongs the cloned SparkSession and the dropping of 
the temporary view must be done via the cloned SparkSession.

Example for the *memory leak*:

{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  spark.catalog.dropTempView(view)
}
{noformat}

*Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
dataframe created by streaming):
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  batchDF.sparkSession.catalog.dropTempView(view)
 }
{noformat}

Example heapdump:
 !1.png! 








> Potential memory leak when temp views created from DF created by structured 
> streaming
> -
>
> Key: SPARK-44253
> URL: https://issues.apache.org/jira/browse/SPARK-44253
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Attila Zsolt Piros
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> If the user registers a temporary view from a dataframe created by Structured 
> Streaming and tries to drop the temporary view via his original SparkSession 
> then memory will be leaking.
> The reason is Structured streaming has its own SparkSession (as a clone of 
> the original Spark Session, for details see 
> https://issues.apache.org/jira/browse/SPARK-26586 and 
> [https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194])
>  the created temporary view belongs the cloned SparkSession and the dropping 
> of the temporary view must be done via the cloned SparkSession.
> Example for the {*}memory leak{*}:
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   spark.catalog.dropTempView(view)
> }
> {noformat}
> *Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
> dataframe created by streaming):
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   batchDF.sparkSession.catalog.dropTempView(view)
>  }
> {noformat}
> h4. Example heap dump
> The SparkSession with the leak:
> !1.png|width=807,height=120!
> The 

[jira] [Created] (SPARK-44254) Move QueryExecutionErrors to sql/api

2023-06-29 Thread Rui Wang (Jira)
Rui Wang created SPARK-44254:


 Summary: Move QueryExecutionErrors to sql/api
 Key: SPARK-44254
 URL: https://issues.apache.org/jira/browse/SPARK-44254
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44253) Potential memory leak when temp views created from DF created by structured streaming

2023-06-29 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44253:
---
Attachment: 2.png

> Potential memory leak when temp views created from DF created by structured 
> streaming
> -
>
> Key: SPARK-44253
> URL: https://issues.apache.org/jira/browse/SPARK-44253
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Attila Zsolt Piros
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> If the user registers a temporary view from a dataframe created by Structured 
> Streaming and tries to drop the temporary view via his original SparkSession 
> then memory will be leaking.
> The reason is Structured streaming has its own SparkSession (as a clone of 
> the original Spark Session, for details see 
> https://issues.apache.org/jira/browse/SPARK-26586 and 
> https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194)
>  the created temporary view belongs the cloned SparkSession and the dropping 
> of the temporary view must be done via the cloned SparkSession.
> Example for the *memory leak*:
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   spark.catalog.dropTempView(view)
> }
> {noformat}
> *Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
> dataframe created by streaming):
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   batchDF.sparkSession.catalog.dropTempView(view)
>  }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44253) Potential memory leak when temp views created from DF created by structured streaming

2023-06-29 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44253:
---
Description: 
If the user registers a temporary view from a dataframe created by Structured 
Streaming and tries to drop the temporary view via his original SparkSession 
then memory will be leaking.

The reason is Structured streaming has its own SparkSession (as a clone of the 
original Spark Session, for details see 
https://issues.apache.org/jira/browse/SPARK-26586 and 
https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194)
 the created temporary view belongs the cloned SparkSession and the dropping of 
the temporary view must be done via the cloned SparkSession.

Example for the *memory leak*:

{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  spark.catalog.dropTempView(view)
}
{noformat}

*Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
dataframe created by streaming):
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  batchDF.sparkSession.catalog.dropTempView(view)
 }
{noformat}

Example heapdump:
 !1.png! 







  was:
If the user registers a temporary view from a dataframe created by Structured 
Streaming and tries to drop the temporary view via his original SparkSession 
then memory will be leaking.

The reason is Structured streaming has its own SparkSession (as a clone of the 
original Spark Session, for details see 
https://issues.apache.org/jira/browse/SPARK-26586 and 
https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194)
 the created temporary view belongs the cloned SparkSession and the dropping of 
the temporary view must be done via the cloned SparkSession.

Example for the *memory leak*:

{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  spark.catalog.dropTempView(view)
}
{noformat}

*Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
dataframe created by streaming):
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  batchDF.sparkSession.catalog.dropTempView(view)
 }
{noformat}










> Potential memory leak when temp views created from DF created by structured 
> streaming
> -
>
> Key: SPARK-44253
> URL: https://issues.apache.org/jira/browse/SPARK-44253
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Attila Zsolt Piros
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> If the user registers a temporary view from a dataframe created by Structured 
> Streaming and tries to drop the temporary view via his original SparkSession 
> then memory will be leaking.
> The reason is Structured streaming has its own SparkSession (as a clone of 
> the original Spark Session, for details see 
> https://issues.apache.org/jira/browse/SPARK-26586 and 
> https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194)
>  the created temporary view belongs the cloned SparkSession and the dropping 
> of the temporary view must be done via the cloned SparkSession.
> Example for the *memory leak*:
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   spark.catalog.dropTempView(view)
> }
> {noformat}
> *Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
> dataframe created by streaming):
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   batchDF.sparkSession.catalog.dropTempView(view)
>  }
> {noformat}
> Example heapdump:
>  !1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44253) Potential memory leak when temp views created from DF created by structured streaming

2023-06-29 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-44253:
--

 Summary: Potential memory leak when temp views created from DF 
created by structured streaming
 Key: SPARK-44253
 URL: https://issues.apache.org/jira/browse/SPARK-44253
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.4.1, 3.4.0, 3.3.2, 3.2.4, 3.1.3, 3.0.3, 2.4.8
Reporter: Attila Zsolt Piros
 Attachments: 1.png

If the user registers a temporary view from a dataframe created by Structured 
Streaming and tries to drop the temporary view via his original SparkSession 
then memory will be leaking.

The reason is Structured streaming has its own SparkSession (as a clone of the 
original Spark Session, for details see 
https://issues.apache.org/jira/browse/SPARK-26586 and 
https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194)
 the created temporary view belongs the cloned SparkSession and the dropping of 
the temporary view must be done via the cloned SparkSession.

Example for the *memory leak*:

{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  spark.catalog.dropTempView(view)
}
{noformat}

*Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
dataframe created by streaming):
{noformat}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  val view = s“tempView_$batchId” 
  batchDF.createOrReplaceTempView(view)
  ...
  batchDF.sparkSession.catalog.dropTempView(view)
 }
{noformat}











--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44253) Potential memory leak when temp views created from DF created by structured streaming

2023-06-29 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44253:
---
Attachment: 1.png

> Potential memory leak when temp views created from DF created by structured 
> streaming
> -
>
> Key: SPARK-44253
> URL: https://issues.apache.org/jira/browse/SPARK-44253
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Attila Zsolt Piros
>Priority: Major
> Attachments: 1.png
>
>
> If the user registers a temporary view from a dataframe created by Structured 
> Streaming and tries to drop the temporary view via his original SparkSession 
> then memory will be leaking.
> The reason is Structured streaming has its own SparkSession (as a clone of 
> the original Spark Session, for details see 
> https://issues.apache.org/jira/browse/SPARK-26586 and 
> https://github.com/apache/spark/blob/branch-3.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L193-L194)
>  the created temporary view belongs the cloned SparkSession and the dropping 
> of the temporary view must be done via the cloned SparkSession.
> Example for the *memory leak*:
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   spark.catalog.dropTempView(view)
> }
> {noformat}
> *Workaround* (the _dropTempView_ must be called on SparkSession accessed from 
> dataframe created by streaming):
> {noformat}
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   val view = s“tempView_$batchId” 
>   batchDF.createOrReplaceTempView(view)
>   ...
>   batchDF.sparkSession.catalog.dropTempView(view)
>  }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44252) Add error class for the case when loading state from DFS fails

2023-06-29 Thread Lucy Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucy Yao updated SPARK-44252:
-
Description: 
This is part of 
[https://github.com/apache/spark/pull/41705|https://github.com/apache/spark/pull/41705.].

Wrap the exception during the loading state, to assign error class properly. 
With assigning error class, we can classify the errors which help us to 
determine what errors customers are struggling much. 
StateStoreProvider.getStore() & StateStoreProvider.getReadStore() is the entry 
point.

This ticket also covers failedToReadDeltaFileError and 
failedToReadSnapshotFileError from 
[https://issues.apache.org/jira/browse/SPARK-36305|http://example.com/].

  was:
Wrap the exception during the loading state, to assign error class properly. 
With assigning error class, we can classify the errors which help us to 
determine what errors customers are struggling much. 
StateStoreProvider.getStore() & StateStoreProvider.getReadStore() is the entry 
point.

This ticket also covers failedToReadDeltaFileError and 
failedToReadSnapshotFileError from 
[https://issues.apache.org/jira/browse/SPARK-36305|http://example.com].


> Add error class for the case when loading state from DFS fails
> --
>
> Key: SPARK-44252
> URL: https://issues.apache.org/jira/browse/SPARK-44252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Lucy Yao
>Priority: Major
>
> This is part of 
> [https://github.com/apache/spark/pull/41705|https://github.com/apache/spark/pull/41705.].
> Wrap the exception during the loading state, to assign error class properly. 
> With assigning error class, we can classify the errors which help us to 
> determine what errors customers are struggling much. 
> StateStoreProvider.getStore() & StateStoreProvider.getReadStore() is the 
> entry point.
> This ticket also covers failedToReadDeltaFileError and 
> failedToReadSnapshotFileError from 
> [https://issues.apache.org/jira/browse/SPARK-36305|http://example.com/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44252) Add error class for the case when loading state from DFS fails

2023-06-29 Thread Lucy Yao (Jira)
Lucy Yao created SPARK-44252:


 Summary: Add error class for the case when loading state from DFS 
fails
 Key: SPARK-44252
 URL: https://issues.apache.org/jira/browse/SPARK-44252
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 3.2.0
Reporter: Lucy Yao


Wrap the exception during the loading state, to assign error class properly. 
With assigning error class, we can classify the errors which help us to 
determine what errors customers are struggling much. 
StateStoreProvider.getStore() & StateStoreProvider.getReadStore() is the entry 
point.

This ticket also covers failedToReadDeltaFileError and 
failedToReadSnapshotFileError from 
[https://issues.apache.org/jira/browse/SPARK-36305|http://example.com].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738762#comment-17738762
 ] 

Bruce Robbins commented on SPARK-44251:
---

This is similar to, but not quite the same as SPARK-43718, and the fix will be 
similar too.

I will make a PR shortly.
 

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-44251:
--
Summary: Potential for incorrect results or NPE when full outer USING join 
has null key value  (was: Potentially incorrect results or NPE when full outer 
USING join has null key value)

> Potential for incorrect results or NPE when full outer USING join has null 
> key value
> 
>
> Key: SPARK-44251
> URL: https://issues.apache.org/jira/browse/SPARK-44251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following query produces incorrect results:
> {noformat}
> create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values (2, 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> -1   <== should be null
> 1
> 2
> {noformat}
> The following query fails with a {{NullPointerException}}:
> {noformat}
> create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
> create or replace temp view v2 as values ('2', 3) as (c1, c2);
> select explode(array(c1)) as x
> from v1
> full outer join v2
> using (c1);
> 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44251) Potentially incorrect results or NPE when full outer USING join has null key value

2023-06-29 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-44251:
-

 Summary: Potentially incorrect results or NPE when full outer 
USING join has null key value
 Key: SPARK-44251
 URL: https://issues.apache.org/jira/browse/SPARK-44251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query produces incorrect results:
{noformat}
create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values (2, 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

-1   <== should be null
1
2
{noformat}
The following query fails with a {{NullPointerException}}:
{noformat}
create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values ('2', 3) as (c1, c2);

select explode(array(c1)) as x
from v1
full outer join v2
using (c1);

23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
...
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44250) Implement classification evaluator

2023-06-29 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-44250:
--

 Summary: Implement classification evaluator
 Key: SPARK-44250
 URL: https://issues.apache.org/jira/browse/SPARK-44250
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML
Affects Versions: 3.5.0
Reporter: Weichen Xu


Implement classification evaluator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44250) Implement classification evaluator

2023-06-29 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-44250:
--

Assignee: Weichen Xu

> Implement classification evaluator
> --
>
> Key: SPARK-44250
> URL: https://issues.apache.org/jira/browse/SPARK-44250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Implement classification evaluator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44249) Refactor PythonUDTFRunner to send its return type separately

2023-06-29 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-44249:
-

 Summary: Refactor PythonUDTFRunner to send its return type 
separately
 Key: SPARK-44249
 URL: https://issues.apache.org/jira/browse/SPARK-44249
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44150) Explicit Arrow casting for mismatched return type in Arrow Python UDF

2023-06-29 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-44150:


Assignee: Xinrong Meng

> Explicit Arrow casting for mismatched return type in Arrow Python UDF
> -
>
> Key: SPARK-44150
> URL: https://issues.apache.org/jira/browse/SPARK-44150
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44150) Explicit Arrow casting for mismatched return type in Arrow Python UDF

2023-06-29 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-44150.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41503
[https://github.com/apache/spark/pull/41503]

> Explicit Arrow casting for mismatched return type in Arrow Python UDF
> -
>
> Key: SPARK-44150
> URL: https://issues.apache.org/jira/browse/SPARK-44150
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44248) Kafka Source v2 should return preferred locations

2023-06-29 Thread Siying Dong (Jira)
Siying Dong created SPARK-44248:
---

 Summary: Kafka Source v2 should return preferred locations
 Key: SPARK-44248
 URL: https://issues.apache.org/jira/browse/SPARK-44248
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.5.0
Reporter: Siying Dong


DSv2 Kafka streaming source seems to miss setting the preferred location, which 
may destroy the purpose of cache for Kafka consumer (connection) & fetched data.

For DSv1, we have set the preferred location in RDD.

For DSv2, we should provide the info. in input partition, but we don't add the 
information into KafkaBatchInputPartition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions

2023-06-29 Thread Ayushi Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738718#comment-17738718
 ] 

Ayushi Agarwal edited comment on SPARK-40176 at 6/29/23 6:10 PM:
-

This solves this issue partially 
https://issues.apache.org/jira/browse/SPARK-41805. Remaining  cases are being 
solved in https://issues.apache.org/jira/browse/SPARK-42588


was (Author: ayaga):
This solves this issue partially 
https://issues.apache.org/jira/browse/SPARK-41805, there still remains few cases

> Enhance collapse window optimization to work in case partition or order by 
> keys are expressions
> ---
>
> Key: SPARK-40176
> URL: https://issues.apache.org/jira/browse/SPARK-40176
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
>
> In window operator with multiple window functions, if any expression is 
> present in partition by or sort order columns, windows are not collapsed even 
> if partition and order by expression is same for all those window functions.
> E.g. query:
> val w = 
> Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value")))
> df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w))
> Current Plan:
> -Window(lead(value,1), key, _w1) -- W1
> - Sort (key, _w1)
> -Project (lower(“value”) as _w1) - P1
> -Window(lead(key,1), key, _w0)  W2
> -Sort(key, _w0)
> -Exchange(key)
> -Project (lower(“value”) as _w0)  P2
> -Scan
>  
> W1 and W2 can be merged in single window



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions

2023-06-29 Thread Ayushi Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738718#comment-17738718
 ] 

Ayushi Agarwal commented on SPARK-40176:


This solves this issue partially 
https://issues.apache.org/jira/browse/SPARK-41805, there still remains few cases

> Enhance collapse window optimization to work in case partition or order by 
> keys are expressions
> ---
>
> Key: SPARK-40176
> URL: https://issues.apache.org/jira/browse/SPARK-40176
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
>
> In window operator with multiple window functions, if any expression is 
> present in partition by or sort order columns, windows are not collapsed even 
> if partition and order by expression is same for all those window functions.
> E.g. query:
> val w = 
> Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value")))
> df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w))
> Current Plan:
> -Window(lead(value,1), key, _w1) -- W1
> - Sort (key, _w1)
> -Project (lower(“value”) as _w1) - P1
> -Window(lead(key,1), key, _w0)  W2
> -Sort(key, _w0)
> -Exchange(key)
> -Project (lower(“value”) as _w0)  P2
> -Scan
>  
> W1 and W2 can be merged in single window



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44247) Upgrade Arrow to 13.0.0

2023-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44247:
--
Description: 
According to Apache Arrow release history, 13.0.0 is expected to be released on 
August 2023.
 - [https://arrow.apache.org/release/]
 -- [12.0.0 (2 May 2023)|https://arrow.apache.org/release/12.0.0.html]
 -- [11.0.0 (26 January 2023)|https://arrow.apache.org/release/11.0.0.html]
 -- [10.0.0 (26 October 2022)|https://arrow.apache.org/release/10.0.0.html]
 -- [9.0.0 (3 August 2022)|https://arrow.apache.org/release/9.0.0.html]
 -- [8.0.0 (6 May 2022)|https://arrow.apache.org/release/8.0.0.html]

  was:
According to Apache Arrow release history, 13.0.0 is expected to be released on 
August 2023.
 - [https://arrow.apache.org/release/]

 * 
 ** [12.0.0 (2 May 2023)|https://arrow.apache.org/release/12.0.0.html]
 ** [11.0.0 (26 January 2023)|https://arrow.apache.org/release/11.0.0.html]
 ** [10.0.0 (26 October 2022)|https://arrow.apache.org/release/10.0.0.html]
 ** [9.0.0 (3 August 2022)|https://arrow.apache.org/release/9.0.0.html]
 ** [8.0.0 (6 May 2022)|https://arrow.apache.org/release/8.0.0.html]


> Upgrade Arrow to 13.0.0
> ---
>
> Key: SPARK-44247
> URL: https://issues.apache.org/jira/browse/SPARK-44247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> According to Apache Arrow release history, 13.0.0 is expected to be released 
> on August 2023.
>  - [https://arrow.apache.org/release/]
>  -- [12.0.0 (2 May 2023)|https://arrow.apache.org/release/12.0.0.html]
>  -- [11.0.0 (26 January 2023)|https://arrow.apache.org/release/11.0.0.html]
>  -- [10.0.0 (26 October 2022)|https://arrow.apache.org/release/10.0.0.html]
>  -- [9.0.0 (3 August 2022)|https://arrow.apache.org/release/9.0.0.html]
>  -- [8.0.0 (6 May 2022)|https://arrow.apache.org/release/8.0.0.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44247) Upgrade Arrow to 13.0.0

2023-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44247:
--
Description: 
According to Apache Arrow release history, 13.0.0 is expected to be released on 
August 2023.
 - [https://arrow.apache.org/release/]

 * 
 ** [12.0.0 (2 May 2023)|https://arrow.apache.org/release/12.0.0.html]
 ** [11.0.0 (26 January 2023)|https://arrow.apache.org/release/11.0.0.html]
 ** [10.0.0 (26 October 2022)|https://arrow.apache.org/release/10.0.0.html]
 ** [9.0.0 (3 August 2022)|https://arrow.apache.org/release/9.0.0.html]
 ** [8.0.0 (6 May 2022)|https://arrow.apache.org/release/8.0.0.html]

  was:
According to Apache Arrow release history, 13.0.0 is expected to be released on 
August 2023.

- [https://arrow.apache.org/release/]


> Upgrade Arrow to 13.0.0
> ---
>
> Key: SPARK-44247
> URL: https://issues.apache.org/jira/browse/SPARK-44247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> According to Apache Arrow release history, 13.0.0 is expected to be released 
> on August 2023.
>  - [https://arrow.apache.org/release/]
>  * 
>  ** [12.0.0 (2 May 2023)|https://arrow.apache.org/release/12.0.0.html]
>  ** [11.0.0 (26 January 2023)|https://arrow.apache.org/release/11.0.0.html]
>  ** [10.0.0 (26 October 2022)|https://arrow.apache.org/release/10.0.0.html]
>  ** [9.0.0 (3 August 2022)|https://arrow.apache.org/release/9.0.0.html]
>  ** [8.0.0 (6 May 2022)|https://arrow.apache.org/release/8.0.0.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44247) Upgrade Arrow to 13.0.0

2023-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44247:
--
Description: 
According to Apache Arrow release history, 13.0.0 is expected to be released on 
August 2023.

- [https://arrow.apache.org/release/]

> Upgrade Arrow to 13.0.0
> ---
>
> Key: SPARK-44247
> URL: https://issues.apache.org/jira/browse/SPARK-44247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> According to Apache Arrow release history, 13.0.0 is expected to be released 
> on August 2023.
> - [https://arrow.apache.org/release/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44247) Upgrade Arrow to 13.0.0

2023-06-29 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44247:
-

 Summary: Upgrade Arrow to 13.0.0
 Key: SPARK-44247
 URL: https://issues.apache.org/jira/browse/SPARK-44247
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44199) CacheManager refreshes the fileIndex unnecessarily

2023-06-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738673#comment-17738673
 ] 

Ignite TC Bot commented on SPARK-44199:
---

User 'vihangk1' has created a pull request for this issue:
https://github.com/apache/spark/pull/41749

> CacheManager refreshes the fileIndex unnecessarily
> --
>
> Key: SPARK-44199
> URL: https://issues.apache.org/jira/browse/SPARK-44199
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Vihang Karajgaonkar
>Priority: Major
>
> The CacheManager on this line 
> [https://github.com/apache/spark/blob/680ca2e56f2c8fc759743ad6755f6e3b1a19c629/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L372]
>  uses a prefix based matching to decide which file index needs to be 
> refreshed. However, that can be incorrect if the users have paths which are 
> not subdirectories but share prefixes. For example, in the function below:
>  
> {code:java}
>   private def refreshFileIndexIfNecessary(
>       fileIndex: FileIndex,
>       fs: FileSystem,
>       qualifiedPath: Path): Boolean = {
>     val prefixToInvalidate = qualifiedPath.toString
>     val needToRefresh = fileIndex.rootPaths
>       .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString)
>       .exists(_.startsWith(prefixToInvalidate))
>     if (needToRefresh) fileIndex.refresh()
>     needToRefresh
>   } {code}
> {{If the prefixToInvalidate is s3://bucket/mypath/table_dir and the file 
> index has one of the root paths as s3://bucket/mypath/table_dir_2/part=1, 
> then the needToRefresh will be true and the file index gets refreshed 
> unnecessarily. This is not just wasted CPU cycles but can cause query 
> failures as well, if there are access restrictions to the path being 
> refreshed.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44165) Exception when reading parquet file with TIME fields

2023-06-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738669#comment-17738669
 ] 

Ignite TC Bot commented on SPARK-44165:
---

User 'ramon-garcia' has created a pull request for this issue:
https://github.com/apache/spark/pull/41717

> Exception when reading parquet file with TIME fields
> 
>
> Key: SPARK-44165
> URL: https://issues.apache.org/jira/browse/SPARK-44165
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
> Environment: Spark 3.4.0 downloaded from apache.spark.org
> Also reproduced with latest build.
>Reporter: Ramón García Fernández
>Priority: Major
> Attachments: timeonly.parquet
>
>
> When one reads a parquet file containing TIME fields (either with INT32 or 
> INT64 storage) and exception is thrown. From spark shell
>  
> {{> val df = spark.read.parquet("timeonly.parquet")}}
> {color:#de350b}23/06/24 13:24:54 ERROR Executor: Exception in task 0.0 in 
> stage 0.0 (TID 0)/ 1]{color}
> {color:#de350b}org.apache.spark.sql.AnalysisException: Illegal Parquet type: 
> INT32 (TIME(MILLIS,true)).{color}
> {color:#de350b}    at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1762){color}
> {color:#de350b}    at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:206){color}
> {color:#de350b}    at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:252){color}
> {color:#de350b}    at scala.Option.getOrElse(Option.scala:189){color}
> {color:#de350b}    at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:224){color}
> {color:#de350b}    at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:187){color}
> {color:#de350b}    at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:147){color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44131) Add call_function and deprecate call_udf for Scala API

2023-06-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738672#comment-17738672
 ] 

Ignite TC Bot commented on SPARK-44131:
---

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41687

> Add call_function and deprecate call_udf for Scala API
> --
>
> Key: SPARK-44131
> URL: https://issues.apache.org/jira/browse/SPARK-44131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> The scala API for SQL exists a method call_udf used to call the user-defined 
> functions.
> In fact, call_udf also could call the builtin functions.
> The behavior is confused for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44200) Support TABLE argument parser rule for TableValuedFunction

2023-06-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738671#comment-17738671
 ] 

Ignite TC Bot commented on SPARK-44200:
---

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/41750

> Support TABLE argument parser rule for TableValuedFunction
> --
>
> Key: SPARK-44200
> URL: https://issues.apache.org/jira/browse/SPARK-44200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44195) Add JobTag APIs to SparkR SparkContext

2023-06-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738668#comment-17738668
 ] 

Ignite TC Bot commented on SPARK-44195:
---

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/41742

> Add JobTag APIs to SparkR SparkContext
> --
>
> Key: SPARK-44195
> URL: https://issues.apache.org/jira/browse/SPARK-44195
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Add APIs added in https://issues.apache.org/jira/browse/SPARK-43952 to SparkR:
>  * {{SparkContext.addJobTag(tag: String): Unit}}
>  * {{SparkContext.removeJobTag(tag: String): Unit}}
>  * {{SparkContext.getJobTags(): Set[String]}}
>  * {{SparkContext.clearJobTags(): Unit}}
>  * {{SparkContext.cancelJobsWithTag(tag: String): Unit}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2023-06-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738670#comment-17738670
 ] 

Ignite TC Bot commented on SPARK-35564:
---

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/41677

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43474) Add support to create DataFrame Reference in Spark connect

2023-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43474:


Assignee: Raghu Angadi

> Add support to create DataFrame Reference in Spark connect
> --
>
> Key: SPARK-43474
> URL: https://issues.apache.org/jira/browse/SPARK-43474
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Peng Zhong
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0
>
>
> Add support in Spark Connect to cache a DataFrame on server side. From client 
> side, it can create a reference to that DataFrame given the cache key.
>  
> This function will be used in streaming foreachBatch(). Server needs to call 
> user function for every batch which takes a DataFrame as argument. With the 
> new function, we can just cache the DataFrame on the server. Pass the id back 
> to client which can creates the DataFrame reference. The server will replace 
> the reference when transforming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43474) Add support to create DataFrame Reference in Spark connect

2023-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43474.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41580
[https://github.com/apache/spark/pull/41580]

> Add support to create DataFrame Reference in Spark connect
> --
>
> Key: SPARK-43474
> URL: https://issues.apache.org/jira/browse/SPARK-43474
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Peng Zhong
>Priority: Major
> Fix For: 3.5.0
>
>
> Add support in Spark Connect to cache a DataFrame on server side. From client 
> side, it can create a reference to that DataFrame given the cache key.
>  
> This function will be used in streaming foreachBatch(). Server needs to call 
> user function for every batch which takes a DataFrame as argument. With the 
> new function, we can just cache the DataFrame on the server. Pass the id back 
> to client which can creates the DataFrame reference. The server will replace 
> the reference when transforming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results

2023-06-29 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-44240:
---
Description: 
 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) 
a; {code}
 

If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
read does not guarantee the order, which leads to the limit read data that may 
be random.

TakeOrderedAndProjectExec has ordering, so there is no such problem.

 

!topKSortFallbackThreshold.png!
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
select min(id) from (select  id  from range(9) order by id desc limit 
1) a; {code}
!topKSortFallbackThresholdDesc.png!

 

  was:
 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) 
a; {code}
 

If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
read does not guarantee the order, which leads to the limit read data that may 
be random.

TakeOrderedAndProjectExec has ordering, so there is no such problem.

 

!topKSortFallbackThreshold.png!

 

 


> Setting the topKSortFallbackThreshold value may lead to inaccurate results
> --
>
> Key: SPARK-44240
> URL: https://issues.apache.org/jira/browse/SPARK-44240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: dzcxzl
>Priority: Minor
> Attachments: topKSortFallbackThreshold.png, 
> topKSortFallbackThresholdDesc.png
>
>
>  
> {code:java}
> set spark.sql.execution.topKSortFallbackThreshold=1;
> SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 
> 1) a; {code}
>  
> If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
> read does not guarantee the order, which leads to the limit read data that 
> may be random.
> TakeOrderedAndProjectExec has ordering, so there is no such problem.
>  
> !topKSortFallbackThreshold.png!
> {code:java}
> set spark.sql.execution.topKSortFallbackThreshold=1;
> select min(id) from (select  id  from range(9) order by id desc limit 
> 1) a; {code}
> !topKSortFallbackThresholdDesc.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results

2023-06-29 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-44240:
---
Description: 
 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) 
a; {code}
 

If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
read does not guarantee the order, which leads to the limit read data that may 
be random.

TakeOrderedAndProjectExec has ordering, so there is no such problem.

 

!topKSortFallbackThreshold.png!

 

 

  was:
 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) 
a; {code}
 

If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
read does not guarantee the order, which leads to the limit read data that may 
be random.

TakeOrderedAndProjectExec has ordering, so there is no such problem.

 

!topKSortFallbackThreshold.png!

 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
select min(id) from (select  id  from range(9) order by id desc limit 
1) a; {code}
!topKSortFallbackThresholdDesc.png!


> Setting the topKSortFallbackThreshold value may lead to inaccurate results
> --
>
> Key: SPARK-44240
> URL: https://issues.apache.org/jira/browse/SPARK-44240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: dzcxzl
>Priority: Minor
> Attachments: topKSortFallbackThreshold.png, 
> topKSortFallbackThresholdDesc.png
>
>
>  
> {code:java}
> set spark.sql.execution.topKSortFallbackThreshold=1;
> SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 
> 1) a; {code}
>  
> If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
> read does not guarantee the order, which leads to the limit read data that 
> may be random.
> TakeOrderedAndProjectExec has ordering, so there is no such problem.
>  
> !topKSortFallbackThreshold.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results

2023-06-29 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-44240:
---
Description: 
 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) 
a; {code}
 

If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
read does not guarantee the order, which leads to the limit read data that may 
be random.

TakeOrderedAndProjectExec has ordering, so there is no such problem.

 

!topKSortFallbackThreshold.png!

 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
select min(id) from (select  id  from range(9) order by id desc limit 
1) a; {code}
!topKSortFallbackThresholdDesc.png!

  was:
 
{code:java}
set spark.sql.execution.topKSortFallbackThreshold=1;
SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) 
a; {code}
 

If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
read does not guarantee the order, which leads to the limit read data that may 
be random.

TakeOrderedAndProjectExec has ordering, so there is no such problem.

 

!topKSortFallbackThreshold.png!

 

 


> Setting the topKSortFallbackThreshold value may lead to inaccurate results
> --
>
> Key: SPARK-44240
> URL: https://issues.apache.org/jira/browse/SPARK-44240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: dzcxzl
>Priority: Minor
> Attachments: topKSortFallbackThreshold.png, 
> topKSortFallbackThresholdDesc.png
>
>
>  
> {code:java}
> set spark.sql.execution.topKSortFallbackThreshold=1;
> SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 
> 1) a; {code}
>  
> If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
> read does not guarantee the order, which leads to the limit read data that 
> may be random.
> TakeOrderedAndProjectExec has ordering, so there is no such problem.
>  
> !topKSortFallbackThreshold.png!
>  
> {code:java}
> set spark.sql.execution.topKSortFallbackThreshold=1;
> select min(id) from (select  id  from range(9) order by id desc limit 
> 1) a; {code}
> !topKSortFallbackThresholdDesc.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results

2023-06-29 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-44240:
---
Attachment: topKSortFallbackThresholdDesc.png

> Setting the topKSortFallbackThreshold value may lead to inaccurate results
> --
>
> Key: SPARK-44240
> URL: https://issues.apache.org/jira/browse/SPARK-44240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: dzcxzl
>Priority: Minor
> Attachments: topKSortFallbackThreshold.png, 
> topKSortFallbackThresholdDesc.png
>
>
>  
> {code:java}
> set spark.sql.execution.topKSortFallbackThreshold=1;
> SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 
> 1) a; {code}
>  
> If GlobalLimitExec is not the final operator and has a sort operator, shuffle 
> read does not guarantee the order, which leads to the limit read data that 
> may be random.
> TakeOrderedAndProjectExec has ordering, so there is no such problem.
>  
> !topKSortFallbackThreshold.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44227) Extract SchemaUtils from StructField

2023-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44227.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41772
[https://github.com/apache/spark/pull/41772]

> Extract SchemaUtils from StructField
> 
>
> Key: SPARK-44227
> URL: https://issues.apache.org/jira/browse/SPARK-44227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44208) assign clear error class names for some logic that directly uses exceptions

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44208.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41740
[https://github.com/apache/spark/pull/41740]

> assign clear error class names for some logic that directly uses exceptions
> ---
>
> Key: SPARK-44208
> URL: https://issues.apache.org/jira/browse/SPARK-44208
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> include:
>  * ALL_PARTITION_COLUMNS_NOT_ALLOWED
>  * INVALID_HIVE_COLUMN_NAME
>  * SPECIFY_BUCKETING_IS_NOT_ALLOWED
>  * SPECIFY_PARTITION_IS_NOT_ALLOWED
>  * UNSUPPORTED_ADD_FILE.DIRECTORY
>  * UNSUPPORTED_ADD_FILE.LOCAL_DIRECTORY



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44208) assign clear error class names for some logic that directly uses exceptions

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44208:


Assignee: BingKun Pan

> assign clear error class names for some logic that directly uses exceptions
> ---
>
> Key: SPARK-44208
> URL: https://issues.apache.org/jira/browse/SPARK-44208
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> include:
>  * ALL_PARTITION_COLUMNS_NOT_ALLOWED
>  * INVALID_HIVE_COLUMN_NAME
>  * SPECIFY_BUCKETING_IS_NOT_ALLOWED
>  * SPECIFY_PARTITION_IS_NOT_ALLOWED
>  * UNSUPPORTED_ADD_FILE.DIRECTORY
>  * UNSUPPORTED_ADD_FILE.LOCAL_DIRECTORY



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44246) Follow-ups for Jar/Classfile Isolation

2023-06-29 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-44246:


 Summary: Follow-ups for Jar/Classfile Isolation
 Key: SPARK-44246
 URL: https://issues.apache.org/jira/browse/SPARK-44246
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Venkata Sai Akhil Gudesa


Related to https://issues.apache.org/jira/browse/SPARK-44146 
([PR|https://github.com/apache/spark/pull/41701]), this ticket is for the 
general follow-ups mentioned by [~hvanhovell] 
[here.|https://github.com/apache/spark/pull/41701#issuecomment-1608577372]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44146) Isolate Spark Connect session/artifacts

2023-06-29 Thread Venkata Sai Akhil Gudesa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-44146:
-
Epic Link: SPARK-42554

> Isolate Spark Connect session/artifacts
> ---
>
> Key: SPARK-44146
> URL: https://issues.apache.org/jira/browse/SPARK-44146
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0
>
>
> Following up on https://issues.apache.org/jira/browse/SPARK-44078, with the 
> support for classloader isolation implemented, we can now utilise it to 
> isolate Spark Connect sessions from each other. Here, isolation refers to 
> isolation of artifacts from each Spark Connect session which enables us to 
> have multi-user UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44208) assign clear error class names for some logic that directly uses exceptions

2023-06-29 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-44208:

Description: 
include:
 * ALL_PARTITION_COLUMNS_NOT_ALLOWED
 * INVALID_HIVE_COLUMN_NAME
 * SPECIFY_BUCKETING_IS_NOT_ALLOWED
 * SPECIFY_PARTITION_IS_NOT_ALLOWED
 * UNSUPPORTED_ADD_FILE.DIRECTORY
 * UNSUPPORTED_ADD_FILE.LOCAL_DIRECTORY

  was:
include:
 * ALL_FOR_PARTITION_COLUMNS_IS_NOT_ALLOWED
 * INVALID_COLUMN_NAME
 * SPECIFY_BUCKETING_IS_NOT_ALLOWED
 * SPECIFY_PARTITION_IS_NOT_ALLOWED
 * UNSUPPORTED_ADD_FILE.DIRECTORY
 * UNSUPPORTED_ADD_FILE.LOCAL_DIRECTORY


> assign clear error class names for some logic that directly uses exceptions
> ---
>
> Key: SPARK-44208
> URL: https://issues.apache.org/jira/browse/SPARK-44208
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> include:
>  * ALL_PARTITION_COLUMNS_NOT_ALLOWED
>  * INVALID_HIVE_COLUMN_NAME
>  * SPECIFY_BUCKETING_IS_NOT_ALLOWED
>  * SPECIFY_PARTITION_IS_NOT_ALLOWED
>  * UNSUPPORTED_ADD_FILE.DIRECTORY
>  * UNSUPPORTED_ADD_FILE.LOCAL_DIRECTORY



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44079) Json reader crashes when a different schema is present

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44079:
-
Fix Version/s: 3.4.2

> Json reader crashes when a different schema is present
> --
>
> Key: SPARK-44079
> URL: https://issues.apache.org/jira/browse/SPARK-44079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: charlotte van der scheun
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0, 3.4.2
>
>
> When using pyspark 3.4, we noticed that when reading a json file with a 
> corrupted record the reader crashes. In pyspark 3.3 this worked fine.
> {*}Code{*}:
> {code:java}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> import json
> data = """[{"a": "incorrect", "b": "correct"}]"""
> schema = StructType([StructField('a', IntegerType(), True), StructField('b', 
> StringType(), True), StructField('_corrupt_record', StringType(), True)])
> spark.read.option("mode", 
> "PERMISSIVE").option("multiline","true").schema(schema).json(spark.sparkContext.parallelize([data])).show(truncate=False){code}
> *Used packages:*
>  * Pyspark==3.4.0
>  * python==3.10.0
>  * delta-spark==2.4.0
>  
> spark_jars=(
>   "org.apache.spark:spark-avro_2.12:3.4.0"
>   ",io.delta:delta-core_2.12:2.4.0"
>   ",com.databricks:spark-xml_2.12:0.16.0"
> )
>  
> {*}Expected behaviour{*}:
> |a|b|_corrupt_record|
> |null|null|[\\{"a": "incorrect", "b": "correct"}]|
>  
> {*}Actual behaviour{*}:
> {code:java}
>  
> *** py4j.protocol.Py4JJavaError: An error occurred while calling 
> o104.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 
> in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 
> (TID 9) (charlottesmbp2.home executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get$(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.get(rows.scala:195)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$2(FailureSafeParser.scala:47)
>         at scala.Option.map(Option.scala:230)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$1(FailureSafeParser.scala:47)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:64)
>         at 
> org.apache.spark.sql.DataFrameReader.$anonfun$json$10(DataFrameReader.scala:431)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at java.base/java.lang.Thread.run(Thread.java:1589)
> Driver stacktrace:
>         at 
> 

[jira] [Created] (SPARK-44245) pyspark.sql.dataframe doctests can behave differently

2023-06-29 Thread Alice Sayutina (Jira)
Alice Sayutina created SPARK-44245:
--

 Summary: pyspark.sql.dataframe doctests can behave differently
 Key: SPARK-44245
 URL: https://issues.apache.org/jira/browse/SPARK-44245
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.1
Reporter: Alice Sayutina






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44244) Assign names to the error class _LEGACY_ERROR_TEMP_[2305-2309]

2023-06-29 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-44244:
--

 Summary: Assign names to the error class 
_LEGACY_ERROR_TEMP_[2305-2309]
 Key: SPARK-44244
 URL: https://issues.apache.org/jira/browse/SPARK-44244
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44169) Assign names to the error class _LEGACY_ERROR_TEMP_[2300-2304]

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44169:


Assignee: jiaan.geng

> Assign names to the error class _LEGACY_ERROR_TEMP_[2300-2304]
> --
>
> Key: SPARK-44169
> URL: https://issues.apache.org/jira/browse/SPARK-44169
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44169) Assign names to the error class _LEGACY_ERROR_TEMP_[2300-2304]

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44169.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41719
[https://github.com/apache/spark/pull/41719]

> Assign names to the error class _LEGACY_ERROR_TEMP_[2300-2304]
> --
>
> Key: SPARK-44169
> URL: https://issues.apache.org/jira/browse/SPARK-44169
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44241) Set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause executor incessantes cons/destructions

2023-06-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738451#comment-17738451
 ] 

ASF GitHub Bot commented on SPARK-44241:


User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/41785

> Set io.connectionTimeout/connectionCreationTimeout to zero or negative will 
> cause executor incessantes cons/destructions
> 
>
> Key: SPARK-44241
> URL: https://issues.apache.org/jira/browse/SPARK-44241
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> 2023-06-28 14:57:23 CST Bootstrap WARN - Failed to set channel option 
> 'CONNECT_TIMEOUT_MILLIS' with value '-1000' for channel '[id: 0xf4b54a73]'
> java.lang.IllegalArgumentException: connectTimeoutMillis : -1000 (expected: 
> >= 0)
>   at 
> io.netty.util.internal.ObjectUtil.checkPositiveOrZero(ObjectUtil.java:144) 
> ~[netty-common-4.1.74.Final.jar:4.1.74.Final] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44243) Add a parameter to determine the locality of local shuffle reader

2023-06-29 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-44243:

Summary: Add a parameter to determine the locality of local shuffle reader  
(was: Local shuffle reader should not respect SHUFFLE_REDUCE_LOCALITY_ENABLE)

> Add a parameter to determine the locality of local shuffle reader
> -
>
> Key: SPARK-44243
> URL: https://issues.apache.org/jira/browse/SPARK-44243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Wan Kun
>Priority: Major
>
> Local shuffle reader can achieve better performance with preferred locations. 
> If we disable SHUFFLE_REDUCE_LOCALITY_ENABLE in queries that include reduce 
> shuffles and local shuffles, local shuffle readers can not get preferred 
> locations.
> Add new parameter LOCAL_SHUFFLE_LOCALITY_ENABLE to determine whether to get 
> the preferred locations of the current partitionSpec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44243) Local shuffle reader should not respect SHUFFLE_REDUCE_LOCALITY_ENABLE

2023-06-29 Thread Wan Kun (Jira)
Wan Kun created SPARK-44243:
---

 Summary: Local shuffle reader should not respect 
SHUFFLE_REDUCE_LOCALITY_ENABLE
 Key: SPARK-44243
 URL: https://issues.apache.org/jira/browse/SPARK-44243
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Wan Kun


Local shuffle reader can achieve better performance with preferred locations. 
If we disable SHUFFLE_REDUCE_LOCALITY_ENABLE in queries that include reduce 
shuffles and local shuffles, local shuffle readers can not get preferred 
locations.

Add new parameter LOCAL_SHUFFLE_LOCALITY_ENABLE to determine whether to get the 
preferred locations of the current partitionSpec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44242) Spark job submission failed because Xmx string is available on one parameter provided into spark.driver.extraJavaOptions

2023-06-29 Thread Nicolas Fraison (Jira)
Nicolas Fraison created SPARK-44242:
---

 Summary: Spark job submission failed because Xmx string is 
available on one parameter provided into spark.driver.extraJavaOptions
 Key: SPARK-44242
 URL: https://issues.apache.org/jira/browse/SPARK-44242
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.4.1, 3.3.2
Reporter: Nicolas Fraison


The spark-submit command failed if Xmx string is found on any parameters 
provided to spark.driver.extraJavaOptions.

For ex. running this spark-submit command line
{code:java}
./bin/spark-submit --class org.apache.spark.examples.SparkPi --conf 
"spark.driver.extraJavaOptions=-Dtest=Xmx"  
examples/jars/spark-examples_2.12-3.4.1.jar 100{code}
failed due to
{code:java}
Error: Not allowed to specify max heap(Xmx) memory settings through java 
options (was -Dtest=Xmx). Use the corresponding --driver-memory or 
spark.driver.memory configuration instead.{code}
The check performed in 
[https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L314]
 seems to broad



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44235) Add --batch to gpg command

2023-06-29 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang resolved SPARK-44235.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

https://github.com/apache/spark-docker/pull/51

> Add --batch to gpg command
> --
>
> Key: SPARK-44235
> URL: https://issues.apache.org/jira/browse/SPARK-44235
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44241) Set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause executor incessantes cons/destructions

2023-06-29 Thread Kent Yao (Jira)
Kent Yao created SPARK-44241:


 Summary: Set io.connectionTimeout/connectionCreationTimeout to 
zero or negative will cause executor incessantes cons/destructions
 Key: SPARK-44241
 URL: https://issues.apache.org/jira/browse/SPARK-44241
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.1, 3.3.2, 3.5.0
Reporter: Kent Yao


{code:java}
2023-06-28 14:57:23 CST Bootstrap WARN - Failed to set channel option 
'CONNECT_TIMEOUT_MILLIS' with value '-1000' for channel '[id: 0xf4b54a73]'
java.lang.IllegalArgumentException: connectTimeoutMillis : -1000 (expected: >= 
0)
at 
io.netty.util.internal.ObjectUtil.checkPositiveOrZero(ObjectUtil.java:144) 
~[netty-common-4.1.74.Final.jar:4.1.74.Final] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44079) Json reader crashes when a different schema is present

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-44079:
-
Component/s: SQL
 (was: python)

> Json reader crashes when a different schema is present
> --
>
> Key: SPARK-44079
> URL: https://issues.apache.org/jira/browse/SPARK-44079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: charlotte van der scheun
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> When using pyspark 3.4, we noticed that when reading a json file with a 
> corrupted record the reader crashes. In pyspark 3.3 this worked fine.
> {*}Code{*}:
> {code:java}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> import json
> data = """[{"a": "incorrect", "b": "correct"}]"""
> schema = StructType([StructField('a', IntegerType(), True), StructField('b', 
> StringType(), True), StructField('_corrupt_record', StringType(), True)])
> spark.read.option("mode", 
> "PERMISSIVE").option("multiline","true").schema(schema).json(spark.sparkContext.parallelize([data])).show(truncate=False){code}
> *Used packages:*
>  * Pyspark==3.4.0
>  * python==3.10.0
>  * delta-spark==2.4.0
>  
> spark_jars=(
>   "org.apache.spark:spark-avro_2.12:3.4.0"
>   ",io.delta:delta-core_2.12:2.4.0"
>   ",com.databricks:spark-xml_2.12:0.16.0"
> )
>  
> {*}Expected behaviour{*}:
> |a|b|_corrupt_record|
> |null|null|[\\{"a": "incorrect", "b": "correct"}]|
>  
> {*}Actual behaviour{*}:
> {code:java}
>  
> *** py4j.protocol.Py4JJavaError: An error occurred while calling 
> o104.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 
> in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 
> (TID 9) (charlottesmbp2.home executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get$(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.get(rows.scala:195)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$2(FailureSafeParser.scala:47)
>         at scala.Option.map(Option.scala:230)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$1(FailureSafeParser.scala:47)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:64)
>         at 
> org.apache.spark.sql.DataFrameReader.$anonfun$json$10(DataFrameReader.scala:431)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at java.base/java.lang.Thread.run(Thread.java:1589)
> Driver stacktrace:
>         at 
> 

[jira] [Resolved] (SPARK-44079) Json reader crashes when a different schema is present

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44079.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41662
[https://github.com/apache/spark/pull/41662]

> Json reader crashes when a different schema is present
> --
>
> Key: SPARK-44079
> URL: https://issues.apache.org/jira/browse/SPARK-44079
> Project: Spark
>  Issue Type: Bug
>  Components: python
>Affects Versions: 3.4.0
>Reporter: charlotte van der scheun
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> When using pyspark 3.4, we noticed that when reading a json file with a 
> corrupted record the reader crashes. In pyspark 3.3 this worked fine.
> {*}Code{*}:
> {code:java}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> import json
> data = """[{"a": "incorrect", "b": "correct"}]"""
> schema = StructType([StructField('a', IntegerType(), True), StructField('b', 
> StringType(), True), StructField('_corrupt_record', StringType(), True)])
> spark.read.option("mode", 
> "PERMISSIVE").option("multiline","true").schema(schema).json(spark.sparkContext.parallelize([data])).show(truncate=False){code}
> *Used packages:*
>  * Pyspark==3.4.0
>  * python==3.10.0
>  * delta-spark==2.4.0
>  
> spark_jars=(
>   "org.apache.spark:spark-avro_2.12:3.4.0"
>   ",io.delta:delta-core_2.12:2.4.0"
>   ",com.databricks:spark-xml_2.12:0.16.0"
> )
>  
> {*}Expected behaviour{*}:
> |a|b|_corrupt_record|
> |null|null|[\\{"a": "incorrect", "b": "correct"}]|
>  
> {*}Actual behaviour{*}:
> {code:java}
>  
> *** py4j.protocol.Py4JJavaError: An error occurred while calling 
> o104.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 
> in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 
> (TID 9) (charlottesmbp2.home executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get$(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.get(rows.scala:195)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$2(FailureSafeParser.scala:47)
>         at scala.Option.map(Option.scala:230)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$1(FailureSafeParser.scala:47)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:64)
>         at 
> org.apache.spark.sql.DataFrameReader.$anonfun$json$10(DataFrameReader.scala:431)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at 

[jira] [Assigned] (SPARK-44079) Json reader crashes when a different schema is present

2023-06-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44079:


Assignee: Jia Fan

> Json reader crashes when a different schema is present
> --
>
> Key: SPARK-44079
> URL: https://issues.apache.org/jira/browse/SPARK-44079
> Project: Spark
>  Issue Type: Bug
>  Components: python
>Affects Versions: 3.4.0
>Reporter: charlotte van der scheun
>Assignee: Jia Fan
>Priority: Major
>
> When using pyspark 3.4, we noticed that when reading a json file with a 
> corrupted record the reader crashes. In pyspark 3.3 this worked fine.
> {*}Code{*}:
> {code:java}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> import json
> data = """[{"a": "incorrect", "b": "correct"}]"""
> schema = StructType([StructField('a', IntegerType(), True), StructField('b', 
> StringType(), True), StructField('_corrupt_record', StringType(), True)])
> spark.read.option("mode", 
> "PERMISSIVE").option("multiline","true").schema(schema).json(spark.sparkContext.parallelize([data])).show(truncate=False){code}
> *Used packages:*
>  * Pyspark==3.4.0
>  * python==3.10.0
>  * delta-spark==2.4.0
>  
> spark_jars=(
>   "org.apache.spark:spark-avro_2.12:3.4.0"
>   ",io.delta:delta-core_2.12:2.4.0"
>   ",com.databricks:spark-xml_2.12:0.16.0"
> )
>  
> {*}Expected behaviour{*}:
> |a|b|_corrupt_record|
> |null|null|[\\{"a": "incorrect", "b": "correct"}]|
>  
> {*}Actual behaviour{*}:
> {code:java}
>  
> *** py4j.protocol.Py4JJavaError: An error occurred while calling 
> o104.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 
> in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 
> (TID 9) (charlottesmbp2.home executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get$(rows.scala:37)
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.get(rows.scala:195)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$2(FailureSafeParser.scala:47)
>         at scala.Option.map(Option.scala:230)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.$anonfun$toResultRow$1(FailureSafeParser.scala:47)
>         at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:64)
>         at 
> org.apache.spark.sql.DataFrameReader.$anonfun$json$10(DataFrameReader.scala:431)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at java.base/java.lang.Thread.run(Thread.java:1589)
> Driver stacktrace:
>         at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
>