date:20230516



 [ 
https://issues.apache.org/jira/browse/SPARK-43532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43532:
-

Assignee: BingKun Pan

> Upgrade `jdbc` related test dependencies
> 
>
> Key: SPARK-43532
> URL: https://issues.apache.org/jira/browse/SPARK-43532
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43532) Upgrade `jdbc` related test dependencies



 [ 
https://issues.apache.org/jira/browse/SPARK-43532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43532.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41194
[https://github.com/apache/spark/pull/41194]

> Upgrade `jdbc` related test dependencies
> 
>
> Key: SPARK-43532
> URL: https://issues.apache.org/jira/browse/SPARK-43532
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723338#comment-17723338
 ] 

Yuming Wang commented on SPARK-43526:
-

Why do you prefer shuffle hash join?

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43509) Support creating multiple sessions for Spark Connect in PySpark

2023-05-16 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723336#comment-17723336
 ] 

Snoot.io commented on SPARK-43509:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/41013

> Support creating multiple sessions for Spark Connect in PySpark
> ---
>
> Key: SPARK-43509
> URL: https://issues.apache.org/jira/browse/SPARK-43509
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43461) Skip compiling useless files when making distribution



 [ 
https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43461:

Fix Version/s: 3.5.0

> Skip compiling useless files when making distribution
> -
>
> Key: SPARK-43461
> URL: https://issues.apache.org/jira/browse/SPARK-43461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.5.0
>
>
> -Dmaven.javadoc.skip=true to skip java doc
> -Dskip=true to skip scala doc. Please see: 
> https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip
> -Dmaven.source.skip to skip build sources.jar
> -Dmaven.test.skip to skip build test-jar
> -Dcyclonedx.skip=true to skip making bom. Please see: 
> https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43461) Skip compiling useless files when making distribution

2023-05-16 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-43461.
--
Resolution: Fixed

Issue resolved by pull request 41141

https://github.com/apache/spark/pull/41141

> Skip compiling useless files when making distribution
> -
>
> Key: SPARK-43461
> URL: https://issues.apache.org/jira/browse/SPARK-43461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> -Dmaven.javadoc.skip=true to skip java doc
> -Dskip=true to skip scala doc. Please see: 
> https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip
> -Dmaven.source.skip to skip build sources.jar
> -Dmaven.test.skip to skip build test-jar
> -Dcyclonedx.skip=true to skip making bom. Please see: 
> https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43461) Skip compiling useless files when making distribution

2023-05-16 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-43461:


Assignee: Yuming Wang

> Skip compiling useless files when making distribution
> -
>
> Key: SPARK-43461
> URL: https://issues.apache.org/jira/browse/SPARK-43461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> -Dmaven.javadoc.skip=true to skip java doc
> -Dskip=true to skip scala doc. Please see: 
> https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip
> -Dmaven.source.skip to skip build sources.jar
> -Dmaven.test.skip to skip build test-jar
> -Dcyclonedx.skip=true to skip making bom. Please see: 
> https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43531) Enable more parity tests for Pandas UDFs.



 [ 
https://issues.apache.org/jira/browse/SPARK-43531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43531:
-

Assignee: Takuya Ueshin

> Enable more parity tests for Pandas UDFs.
> -
>
> Key: SPARK-43531
> URL: https://issues.apache.org/jira/browse/SPARK-43531
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43531) Enable more parity tests for Pandas UDFs.



 [ 
https://issues.apache.org/jira/browse/SPARK-43531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43531.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41193
[https://github.com/apache/spark/pull/41193]

> Enable more parity tests for Pandas UDFs.
> -
>
> Key: SPARK-43531
> URL: https://issues.apache.org/jira/browse/SPARK-43531
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43488) bitmap function

2023-05-16 Thread yiku123 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yiku123 updated SPARK-43488:

Description: 
maybe spark need to have some bitmap functions？ example  like bitmapBuild 
、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。

This is often used in user profiling applications but i don't find in spark

 

 
h2.  

  was:
maybe spark need to have some bitmap functions？ example  like bitmapBuild 
、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。

This is often used in user profiling applications but i don't find in spark

 

 
h2.


> bitmap function
> ---
>
> Key: SPARK-43488
> URL: https://issues.apache.org/jira/browse/SPARK-43488
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yiku123
>Priority: Major
>
> maybe spark need to have some bitmap functions？ example  like bitmapBuild 
> 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。
> This is often used in user profiling applications but i don't find in spark
>  
>  
> h2.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43532) Upgrade `jdbc` related test dependencies

2023-05-16 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-43532:
---

 Summary: Upgrade `jdbc` related test dependencies
 Key: SPARK-43532
 URL: https://issues.apache.org/jira/browse/SPARK-43532
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43524) Memory leak in Spark UI



 [ 
https://issues.apache.org/jira/browse/SPARK-43524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43524.
-
Resolution: Duplicate

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43524
> URL: https://issues.apache.org/jira/browse/SPARK-43524
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of active jobs in Spark UI increased rapidly and remained high even 
> though all submitted queries have completed. We have also noticed that some 
> executor task counters in Spark UI were negative, which confirms that 
> AppStatusListener state does not accurately reflect the reality and that it 
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in 
> AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
> enforcing another configurable limit on the number of total objects in 
> AppStatusListener's maps and kvstore. This should limit the leak in the case 
> of high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET



 [ 
https://issues.apache.org/jira/browse/SPARK-43521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43521:

Issue Type: New Feature  (was: Bug)

> Support CREATE TABLE LIKE FILE for PARQUET
> --
>
> Key: SPARK-43521
> URL: https://issues.apache.org/jira/browse/SPARK-43521
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: https://issues.apache.org/jira/browse/HIVE-26395



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43531) Enable more parity tests for Pandas UDFs.

2023-05-16 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-43531:
-

 Summary: Enable more parity tests for Pandas UDFs.
 Key: SPARK-43531
 URL: https://issues.apache.org/jira/browse/SPARK-43531
 Project: Spark
  Issue Type: Test
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43525) Enhance ImportOrderChecker rules for `group.scala`



 [ 
https://issues.apache.org/jira/browse/SPARK-43525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43525:


Assignee: BingKun Pan

> Enhance ImportOrderChecker rules for `group.scala`
> --
>
> Key: SPARK-43525
> URL: https://issues.apache.org/jira/browse/SPARK-43525
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43525) Enhance ImportOrderChecker rules for `group.scala`



 [ 
https://issues.apache.org/jira/browse/SPARK-43525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43525.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41185
[https://github.com/apache/spark/pull/41185]

> Enhance ImportOrderChecker rules for `group.scala`
> --
>
> Key: SPARK-43525
> URL: https://issues.apache.org/jira/browse/SPARK-43525
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43528) Support duplicated field names in createDataFrame with pandas DataFrame.



 [ 
https://issues.apache.org/jira/browse/SPARK-43528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43528.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41190
[https://github.com/apache/spark/pull/41190]

> Support duplicated field names in createDataFrame with pandas DataFrame.
> 
>
> Key: SPARK-43528
> URL: https://issues.apache.org/jira/browse/SPARK-43528
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43528) Support duplicated field names in createDataFrame with pandas DataFrame.



 [ 
https://issues.apache.org/jira/browse/SPARK-43528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43528:


Assignee: Takuya Ueshin

> Support duplicated field names in createDataFrame with pandas DataFrame.
> 
>
> Key: SPARK-43528
> URL: https://issues.apache.org/jira/browse/SPARK-43528
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43527) Fix catalog.listCatalogs in PySpark



 [ 
https://issues.apache.org/jira/browse/SPARK-43527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43527.
--
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 41186
[https://github.com/apache/spark/pull/41186]

> Fix catalog.listCatalogs in PySpark
> ---
>
> Key: SPARK-43527
> URL: https://issues.apache.org/jira/browse/SPARK-43527
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.4.1, 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Critical
> Fix For: 3.5.0, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43527) Fix catalog.listCatalogs in PySpark



 [ 
https://issues.apache.org/jira/browse/SPARK-43527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43527:


Assignee: Ruifeng Zheng

> Fix catalog.listCatalogs in PySpark
> ---
>
> Key: SPARK-43527
> URL: https://issues.apache.org/jira/browse/SPARK-43527
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.4.1, 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43360) Scala Connect: Add StreamingQueryManager API



 [ 
https://issues.apache.org/jira/browse/SPARK-43360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43360.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41039
[https://github.com/apache/spark/pull/41039]

> Scala Connect: Add StreamingQueryManager API
> 
>
> Key: SPARK-43360
> URL: https://issues.apache.org/jira/browse/SPARK-43360
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43360) Scala Connect: Add StreamingQueryManager API



 [ 
https://issues.apache.org/jira/browse/SPARK-43360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43360:


Assignee: Wei Liu

> Scala Connect: Add StreamingQueryManager API
> 
>
> Key: SPARK-43360
> URL: https://issues.apache.org/jira/browse/SPARK-43360
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43530) Protobuf: Read descriptor file only once at the compile time

2023-05-16 Thread Raghu Angadi (Jira)

Raghu Angadi created SPARK-43530:


 Summary: Protobuf: Read descriptor file only once at the compile 
time
 Key: SPARK-43530
 URL: https://issues.apache.org/jira/browse/SPARK-43530
 Project: Spark
  Issue Type: Task
  Components: Protobuf
Affects Versions: 3.5.0
Reporter: Raghu Angadi
 Fix For: 3.5.0


Protobuf functions read from the descriptor file many time (e.g. at each 
executor). This is unncessary and error prone (e.g. what if the contents change 
couple of days after the streaming query starts?).

 

It only needs to be read once. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43529) Support general expressions as OPTIONS values

2023-05-16 Thread Daniel (Jira)

Daniel created SPARK-43529:
--

 Summary: Support general expressions as OPTIONS values 
 Key: SPARK-43529
 URL: https://issues.apache.org/jira/browse/SPARK-43529
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43528) Support duplicated field names in createDataFrame with pandas DataFrame.

2023-05-16 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-43528:
-

 Summary: Support duplicated field names in createDataFrame with 
pandas DataFrame.
 Key: SPARK-43528
 URL: https://issues.apache.org/jira/browse/SPARK-43528
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42958) Refactor `CheckConnectJvmClientCompatibility` to compare client and avro

2023-05-16 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-42958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42958.
---
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

> Refactor `CheckConnectJvmClientCompatibility` to compare client and avro
> 
>
> Key: SPARK-42958
> URL: https://issues.apache.org/jira/browse/SPARK-42958
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-16 Thread Svyatoslav Semenyuk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Environment: 
Scala version: 2.12.17

Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.

Spark 3.3.2 deployed on cluster was used to check the issue on real data.

  was:
Scala version: 2.12.17

Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.

Spark 3.3.1 deployed on cluster was used to check the issue on real data.


> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features caused by certain SQL functions
> --
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.2, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.2 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: Must have 
> at least 1 non zero

[jira] [Resolved] (SPARK-43043) Improve the performance of MapOutputTracker.updateMapOutput

2023-05-16 Thread Xingbo Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-43043.
--
Fix Version/s: 3.4.1
   Resolution: Done

> Improve the performance of MapOutputTracker.updateMapOutput
> ---
>
> Key: SPARK-43043
> URL: https://issues.apache.org/jira/browse/SPARK-43043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.4.1
>
>
> Inside of MapOutputTracker, there is a line of code which does a linear find 
> through a mapStatuses collection: 
> https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167
>   (plus a similar search a few lines down at 
> https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174)
> This scan is necessary because we only know the mapId of the updated status 
> and not its mapPartitionId.
> We perform this scan once per migrated block, so if a large proportion of all 
> blocks in the map are migrated then we get O(n^2) total runtime across all of 
> the calls.
> I think we might be able to fix this by extending ShuffleStatus to have an 
> OpenHashMap mapping from mapId to mapPartitionId. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43359) DELETE from Hive table result in INTERNAL error



 [ 
https://issues.apache.org/jira/browse/SPARK-43359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43359:
-

Assignee: BingKun Pan

> DELETE from Hive table result in INTERNAL error
> ---
>
> Key: SPARK-43359
> URL: https://issues.apache.org/jira/browse/SPARK-43359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: BingKun Pan
>Priority: Minor
>
> spark-sql (default)> CREATE TABLE T1(c1 INT);
> spark-sql (default)> DELETE FROM T1 WHERE c1 = 1;
> [INTERNAL_ERROR] Unexpected table relation: HiveTableRelation 
> [`spark_catalog`.`default`.`t1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], 
> Partition Cols: []]
> org.apache.spark.SparkException: [INTERNAL_ERROR] Unexpected table relation: 
> HiveTableRelation [`spark_catalog`.`default`.`t1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], 
> Partition Cols: []]
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:77)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:81)
>   at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy.apply(DataSourceV2Strategy.scala:310)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43359) DELETE from Hive table result in INTERNAL error



 [ 
https://issues.apache.org/jira/browse/SPARK-43359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43359.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41172
[https://github.com/apache/spark/pull/41172]

> DELETE from Hive table result in INTERNAL error
> ---
>
> Key: SPARK-43359
> URL: https://issues.apache.org/jira/browse/SPARK-43359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> spark-sql (default)> CREATE TABLE T1(c1 INT);
> spark-sql (default)> DELETE FROM T1 WHERE c1 = 1;
> [INTERNAL_ERROR] Unexpected table relation: HiveTableRelation 
> [`spark_catalog`.`default`.`t1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], 
> Partition Cols: []]
> org.apache.spark.SparkException: [INTERNAL_ERROR] Unexpected table relation: 
> HiveTableRelation [`spark_catalog`.`default`.`t1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], 
> Partition Cols: []]
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:77)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:81)
>   at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy.apply(DataSourceV2Strategy.scala:310)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-16 Thread Svyatoslav Semenyuk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723216#comment-17723216
 ] 

Svyatoslav Semenyuk commented on SPARK-43514:
-

We applied "current workaround" to application code and this does not solve the 
issue.

> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features caused by certain SQL functions
> --
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.2, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.1 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: Must have 
> at least 1 non zero entry.
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
>   at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
>   ... many

[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions

2023-05-16 Thread Svyatoslav Semenyuk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svyatoslav Semenyuk updated SPARK-43514:

Affects Version/s: 3.3.2
   (was: 3.3.1)

> Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML 
> features caused by certain SQL functions
> --
>
> Key: SPARK-43514
> URL: https://issues.apache.org/jira/browse/SPARK-43514
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.3.2, 3.4.0
> Environment: Scala version: 2.12.17
> Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0.
> Spark 3.3.1 deployed on cluster was used to check the issue on real data.
>Reporter: Svyatoslav Semenyuk
>Priority: Major
>  Labels: ml, sql
>
> We designed a function that joins two DFs on common column with some 
> similarity. All next code will be on Scala 2.12.
> I've added {{show}} calls for demonstration purposes.
> {code:scala}
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, 
> RegexTokenizer, MinHashLSHModel}
> import org.apache.spark.sql.{DataFrame, Column}
> /**
>  * Joins two data frames on a string column using LSH algorithm
>  * for similarity computation.
>  *
>  * If input data frames have columns with identical names,
>  * the resulting dataframe will have columns from them both
>  * with prefixes `datasetA` and `datasetB` respectively.
>  *
>  * For example, if both dataframes have a column with name `myColumn`,
>  * then the result will have columns `datasetAMyColumn` and 
> `datasetBMyColumn`.
>  */
> def similarityJoin(
> df: DataFrame,
> anotherDf: DataFrame,
> joinExpr: String,
> threshold: Double = 0.8,
> ): DataFrame = {
> df.show(false)
> anotherDf.show(false)
> val pipeline = new Pipeline().setStages(Array(
> new RegexTokenizer()
> .setPattern("")
> .setMinTokenLength(1)
> .setInputCol(joinExpr)
> .setOutputCol("tokens"),
> new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
> new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
> new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"),
> )
> )
> val model = pipeline.fit(df)
> val storedHashed = model.transform(df)
> val landedHashed = model.transform(anotherDf)
> val commonColumns = df.columns.toSet & anotherDf.columns.toSet
> /**
>  * Converts column name from a data frame to the column of resulting 
> dataset.
>  */
> def convertColumn(datasetName: String)(columnName: String): Column = {
> val newName =
> if (commonColumns.contains(columnName)) 
> s"$datasetName${columnName.capitalize}"
> else columnName
> col(s"$datasetName.$columnName") as newName
> }
> val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++
>   anotherDf.columns.map(convertColumn("datasetB"))
> val result = model
> .stages
> .last
> .asInstanceOf[MinHashLSHModel]
> .approxSimilarityJoin(storedHashed, landedHashed, threshold, 
> "confidence")
> .select(columnsToSelect.toSeq: _*)
> result.show(false)
> result
> }
> {code}
> Now consider such simple example:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example runs with no errors and outputs 3 empty DFs. Let's add 
> {{distinct}} method to one data frame:
> {code:scala}
> val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > 
> 2) as "df1"
> val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2"
> similarityJoin(inputDF1, inputDF2, "name", 0.6)
> {code}
> This example outputs two empty DFs and then fails at {{result.show(false)}}. 
> Error:
> {code:none}
> org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user 
> defined function (LSHModel$$Lambda$3769/0x000101804840: 
> (struct,values:array>) => 
> array,values:array>>).
>   ... many elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: Must have 
> at least 1 non zero entry.
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61)
>   at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
>   ... many more
> {code}
> 
> Now let's take a look on the example

[jira] [Resolved] (SPARK-43520) Upgrade mysql-connector-java from 8.0.32 to 8.0.33



 [ 
https://issues.apache.org/jira/browse/SPARK-43520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43520.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41182
[https://github.com/apache/spark/pull/41182]

> Upgrade mysql-connector-java from 8.0.32 to 8.0.33
> --
>
> Key: SPARK-43520
> URL: https://issues.apache.org/jira/browse/SPARK-43520
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43520) Upgrade mysql-connector-java from 8.0.32 to 8.0.33



 [ 
https://issues.apache.org/jira/browse/SPARK-43520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43520:
-

Assignee: BingKun Pan

> Upgrade mysql-connector-java from 8.0.32 to 8.0.33
> --
>
> Key: SPARK-43520
> URL: https://issues.apache.org/jira/browse/SPARK-43520
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38469) Use error classes in org.apache.spark.network



 [ 
https://issues.apache.org/jira/browse/SPARK-38469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38469:


Assignee: Bo Zhang

> Use error classes in org.apache.spark.network
> -
>
> Key: SPARK-38469
> URL: https://issues.apache.org/jira/browse/SPARK-38469
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38469) Use error classes in org.apache.spark.network



 [ 
https://issues.apache.org/jira/browse/SPARK-38469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38469.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41140
[https://github.com/apache/spark/pull/41140]

> Use error classes in org.apache.spark.network
> -
>
> Key: SPARK-38469
> URL: https://issues.apache.org/jira/browse/SPARK-38469
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



 [ 
https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43512.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41175
[https://github.com/apache/spark/pull/41175]

> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
> -
>
> Key: SPARK-43512
> URL: https://issues.apache.org/jira/browse/SPARK-43512
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
> Fix For: 3.5.0
>
>
> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



 [ 
https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43512:
-

Assignee: Anish Shrigondekar

> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
> -
>
> Key: SPARK-43512
> URL: https://issues.apache.org/jira/browse/SPARK-43512
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>
> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



 [ 
https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43512:
--
Issue Type: Test  (was: Task)

> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
> -
>
> Key: SPARK-43512
> URL: https://issues.apache.org/jira/browse/SPARK-43512
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



 [ 
https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43512:
--
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
> -
>
> Key: SPARK-43512
> URL: https://issues.apache.org/jira/browse/SPARK-43512
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'

2023-05-16 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723158#comment-17723158
 ] 

Jia Fan commented on SPARK-43522:
-

https://github.com/apache/spark/pull/41187

> Creating struct column occurs  error 'org.apache.spark.sql.AnalysisException 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
> -
>
> Key: SPARK-43522
> URL: https://issues.apache.org/jira/browse/SPARK-43522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Heedo Lee
>Priority: Minor
>
> When creating a struct column in Dataframe, the code that ran without 
> problems in version 3.3.1 does not work in version 3.4.0.
>  
> Example
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code}
>  
> In 3.3.1
>  
> {code:java}
>  
> testDF.show()
> +---+---++ 
> |      value|      key_value|           map_entry| 
> +---+---++ 
> |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
> +---+---++
>  
> testDF.printSchema()
> root
>  |-- value: string (nullable = true)
>  |-- key_value: array (nullable = true)
>  |    |-- element: string (containsNull = false)
>  |-- map_entry: array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- col1: string (nullable = true)
>  |    |    |-- col2: string (nullable = true)
> {code}
>  
>  
> In 3.4.0
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot 
> resolve "struct(split(namedlambdavariable(), =, -1)[0], 
> split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only 
> foldable `STRING` expressions are allowed to appear at odd position, but they 
> are ["0", "1"].;
> 'Project [value#41, key_value#45, transform(key_value#45, 
> lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda 
> x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
> +- Project [value#41, split(value#41, ,, -1) AS key_value#45]
>    +- LocalRelation [value#41]  at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> 
>  
> {code}
>  
> However, if you do an alias to struct elements, you can get the same result 
> as the previous version.
>  
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0).as("col1") , split(x, 
> "=").getItem(1).as("col2") ) )){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: image-2023-05-16-21-23-33-611.png)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: image-2023-05-16-21-22-44-532.png)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: image-2023-05-16-21-20-18-727.png)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: application_1684208757063_0028_90.html)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: application_1684208757063_0028_90.html

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: application_1684208757063_0028_90.html, 
> image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-44-163.png|width=935,height=64!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-11-514.png|width=922,height=67!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

 

 

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-44-163.png|width=935,height=64!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-11-514.png|width=922,height=67!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: application_1684208757063_0028_90.html, 
> image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-44-163.png|width=935,height=64!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-11-514.png|width=922,height=67!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, 
> image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43527) Fix catalog.listCatalogs in PySpark



 [ 
https://issues.apache.org/jira/browse/SPARK-43527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43527:
--
Summary: Fix catalog.listCatalogs in PySpark  (was: Fix 
catalog.listCatalogs)

> Fix catalog.listCatalogs in PySpark
> ---
>
> Key: SPARK-43527
> URL: https://issues.apache.org/jira/browse/SPARK-43527
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.4.1, 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-28-11-514.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, 
> image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-28-44-163.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, 
> image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43527) Fix catalog.listCatalogs

Ruifeng Zheng created SPARK-43527:
-

 Summary: Fix catalog.listCatalogs
 Key: SPARK-43527
 URL: https://issues.apache.org/jira/browse/SPARK-43527
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0, 3.4.1, 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1190,height=78!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1190,height=78!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1190,height=78!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

 

and When shuffledHashJoin is enabled, gc is very serious.

!image-2023-05-16-21-12-24-618.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-15-21-047.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png!
> !image-2023-05-16-21-21-35-493.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png!
> !image-2023-05-16-21-22-16-170.png!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=1340,height=92!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=1340,height=92!
> !image-2023-05-16-21-21-35-493.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png!
> !image-2023-05-16-21-22-16-170.png!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-24-09-182.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=1340,height=92!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png!
> !image-2023-05-16-21-22-16-170.png!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-23-35-237.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-23-33-611.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-22-16-170.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-22-44-532.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-21-35-493.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-20-18-727.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

 

and When shuffledHashJoin is enabled, gc is very serious.

!image-2023-05-16-21-12-24-618.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-15-21-047.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 

>From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin).

enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

And When shuffledHashJoin is enabled, gc is very serious

!image-2023-05-16-21-12-24-618.png!

But sortMergeJoin executes without this problem

!image-2023-05-16-21-15-21-047.png!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

caican created SPARK-43526:
--

 Summary: when shuffle hash join is enabled, q95 performance 
deteriorates
 Key: SPARK-43526
 URL: https://issues.apache.org/jira/browse/SPARK-43526
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0, 3.1.2
Reporter: caican


Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 

>From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin).

enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

And When shuffledHashJoin is enabled, gc is very serious

!image-2023-05-16-21-12-24-618.png!

But sortMergeJoin executes without this problem

!image-2023-05-16-21-15-21-047.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source



 [ 
https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39281:


Assignee: Jia Fan

> Speed up Timestamp type inference of legacy format in JSON/CSV data source
> --
>
> Key: SPARK-39281
> URL: https://issues.apache.org/jira/browse/SPARK-39281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Jia Fan
>Priority: Major
>
> The optimization of {{DefaultTimestampFormatter}} has been implemented in 
> [#36562|https://github.com/apache/spark/pull/36562] , this ticket adds the 
> optimization of legacy format. The basic logic is to prevent the formatter 
> from throwing exceptions, and then use catch to determine whether the parsing 
> is successful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source



 [ 
https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39281.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41091
[https://github.com/apache/spark/pull/41091]

> Speed up Timestamp type inference of legacy format in JSON/CSV data source
> --
>
> Key: SPARK-39281
> URL: https://issues.apache.org/jira/browse/SPARK-39281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> The optimization of {{DefaultTimestampFormatter}} has been implemented in 
> [#36562|https://github.com/apache/spark/pull/36562] , this ticket adds the 
> optimization of legacy format. The basic logic is to prevent the formatter 
> from throwing exceptions, and then use catch to determine whether the parsing 
> is successful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod

2023-05-16 Thread Nikita Awasthi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723109#comment-17723109
 ] 

Nikita Awasthi commented on SPARK-43504:


User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/41181

> [K8S] Mounts the hadoop config map on the executor pod
> --
>
> Key: SPARK-43504
> URL: https://issues.apache.org/jira/browse/SPARK-43504
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop 
> config map will not be mounted on the executor pod.
> Per the  [https://github.com/apache/spark/pull/22911] description:
> {code:java}
> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks. {code}
> But in fact, the executor still need the hadoop configuration.
>  
> !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png!
>  
> As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
> executor can not.
> so we still need to mount the hadoop config map in executor side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43524) Memory leak in Spark UI

2023-05-16 Thread Amine Bagdouri (Jira)

[
https://issues.apache.org/jira/browse/SPARK-43524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Amine Bagdouri updated SPARK-43524:
---
Description:
We have a distributed Spark application running on Azure HDInsight using Spark
version 2.4.4.

After a few days of active processing on our application, we have noticed that
the GC CPU time ratio of the driver is close to 100%. We suspected a memory
leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory
Analyzer.

Here is some interesting data from the driver's heap dump (heap size is 8 GB):
* The estimated retained heap size of String objects (~5M instances) is 3.3
GB. It seems that most of these instances correspond to spark events.
* Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
* The number of LiveJob objects with status "RUNNING" is 18K, knowing that
there shouldn't be more than 16 live running jobs since we use a fixed size
thread pool of 16 threads to run spark queries.
* The number of LiveTask objects is 485K.
* The AsyncEventQueue instance associated to the AppStatusListener has a value
of 854 for dropped events count and a value of 10001 for total events count,
knowing that the dropped events counter is reset every minute and that the
queue's default capacity is 1.

We think that there is a memory leak in Spark UI. Here is our analysis of the
root cause of this leak:
* AppStatusListener is notified of Spark events using a bounded queue in
AsyncEventQueue.
* AppStatusListener updates its state (kvstore, liveTasks, liveStages,
liveJobs, ...) based on the received events. For example, onTaskStart adds a
task to liveTasks map and onTaskEnd removes the task from liveTasks map.
* When the rate of events is very high, the bounded queue in AsyncEventQueue
is full, some events are dropped and don't make it to AppStatusListener.
* Dropped events that signal the end of a processing unit prevent the state of
AppStatusListener from being cleaned. For example, a dropped onTaskEnd event,
will prevent the task from being removed from liveTasks map, and the task will
remain in the heap until the driver's JVM is stopped.

We were able to confirm our analysis by reducing the capacity of the
AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After
having launched many spark queries using this config, we observed that the
number of active jobs in Spark UI increased rapidly and remained high even
though all submitted queries have completed. We have also noticed that some
executor task counters in Spark UI were negative, which confirms that
AppStatusListener state does not accurately reflect the reality and that it can
be a victim of event drops.

Suggested fix:
There are some limits today on the number of "dead" objects in
AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest
enforcing another configurable limit on the number of total objects in
AppStatusListener's maps and kvstore. This should limit the leak in the case of
high events rate, but AppStatusListener stats will remain inaccurate.

was:
We have a distributed Spark application running on Azure HDInsight using Spark
version 2.4.4.

Here is some interesting data from the driver's heap dump (heap size is 8 GB):
* The estimated retained heap size of String objects (~5M instances) is 3.3
GB. It seems that most of these instances correspond to spark events.
* Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
* The number of LiveJob objects with status "RUNNING" is 18K, knowing that
there shouldn't be more than 16 live running jobs since we use a fixed thread
pool of 16 threads to run spark queries.
* The number of LiveTask objects is 485K.
* The AsyncEventQueue instance associated to the AppStatusListener has a value
of 854 for dropped events count and a value of 10001 for total events count,
knowing that the dropped events counter is reset every minute and that the
queue's default capacity is 1.

[jira] [Created] (SPARK-43525) Enhance ImportOrderChecker rules for `group.scala`

2023-05-16 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-43525:
---

 Summary: Enhance ImportOrderChecker rules for `group.scala`
 Key: SPARK-43525
 URL: https://issues.apache.org/jira/browse/SPARK-43525
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR



 [ 
https://issues.apache.org/jira/browse/SPARK-43518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43518:


Assignee: BingKun Pan

> Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
> ---
>
> Key: SPARK-43518
> URL: https://issues.apache.org/jira/browse/SPARK-43518
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR



 [ 
https://issues.apache.org/jira/browse/SPARK-43518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43518.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41179
[https://github.com/apache/spark/pull/41179]

> Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
> ---
>
> Key: SPARK-43518
> URL: https://issues.apache.org/jira/browse/SPARK-43518
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43524) Memory leak in Spark UI

2023-05-16 Thread Amine Bagdouri (Jira)

Amine Bagdouri created SPARK-43524:
--

 Summary: Memory leak in Spark UI
 Key: SPARK-43524
 URL: https://issues.apache.org/jira/browse/SPARK-43524
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.4
Reporter: Amine Bagdouri


We have a distributed Spark application running on Azure HDInsight using Spark 
version 2.4.4.

After a few days of active processing on our application, we have noticed that 
the GC CPU time ratio of the driver is close to 100%. We suspected a memory 
leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory 
Analyzer.

Here is some interesting data from the driver's heap dump (heap size is 8 GB):
 * The estimated retained heap size of String objects (~5M instances) is 3.3 
GB. It seems that most of these instances correspond to spark events.
 * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
 * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
there shouldn't be more than 16 live running jobs since we use a fixed thread 
pool of 16 threads to run spark queries.
 * The number of LiveTask objects is 485K.
 * The AsyncEventQueue instance associated to the AppStatusListener has a value 
of 854 for dropped events count and a value of 10001 for total events count, 
knowing that the dropped events counter is reset every minute and that the 
queue's default capacity is 1.

We think that there is a memory leak in Spark UI. Here is our analysis of the 
root cause of this leak:
 * AppStatusListener is notified of Spark events using a bounded queue in 
AsyncEventQueue.
 * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
liveJobs, ...) based on the received events. For example, onTaskStart adds a 
task to liveTasks map and onTaskEnd removes the task from liveTasks map.
 * When the rate of events is very high, the bounded queue in AsyncEventQueue 
is full, some events are dropped and don't make it to AppStatusListener.
 * Dropped events that signal the end of a processing unit prevent the state of 
AppStatusListener from being cleaned. For example, a dropped onTaskEnd event, 
will prevent the task from being removed from liveTasks map, and the task will 
remain in the heap until the driver's JVM is stopped.

We were able to confirm our analysis by reducing the capacity of the 
AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
having launched many spark queries using this config, we observed that the 
number of active jobs in Spark UI increased rapidly and remained high even 
though all submitted queries have completed. We have also noticed that some 
executor task counters in Spark UI were negative, which confirms that 
AppStatusListener state does not accurately reflect the reality and that it can 
be a victim of event drops.

Suggested fix:
There are some limits today on the number of "dead" objects in 
AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
enforcing another configurable limit on the number of total objects in 
AppStatusListener's maps and kvstore. This should limit the leak in the case of 
high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43523) Memory leak in Spark UI

2023-05-16 Thread Amine Bagdouri (Jira)

Amine Bagdouri created SPARK-43523:
--

 Summary: Memory leak in Spark UI
 Key: SPARK-43523
 URL: https://issues.apache.org/jira/browse/SPARK-43523
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.4
Reporter: Amine Bagdouri


We have a distributed Spark application running on Azure HDInsight using Spark 
version 2.4.4.

After a few days of active processing on our application, we have noticed that 
the GC CPU time ratio of the driver is close to 100%. We suspected a memory 
leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory 
Analyzer.

Here is some interesting data from the driver's heap dump (heap size is 8 GB):
 * The estimated retained heap size of String objects (~5M instances) is 3.3 
GB. It seems that most of these instances correspond to spark events.
 * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
 * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
there shouldn't be more than 16 live running jobs since we use a fixed thread 
pool of 16 threads to run spark queries.
 * The number of LiveTask objects is 485K.
 * The AsyncEventQueue instance associated to the AppStatusListener has a value 
of 854 for dropped events count and a value of 10001 for total events count, 
knowing that the dropped events counter is reset every minute and that the 
queue's default capacity is 1.

We think that there is a memory leak in Spark UI. Here is our analysis of the 
root cause of this leak:
 * AppStatusListener is notified of Spark events using a bounded queue in 
AsyncEventQueue.
 * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
liveJobs, ...) based on the received events. For example, onTaskStart adds a 
task to liveTasks map and onTaskEnd removes the task from liveTasks map.
 * When the rate of events is very high, the bounded queue in AsyncEventQueue 
is full, some events are dropped and don't make it to AppStatusListener.
 * Dropped events that signal the end of a processing unit prevent the state of 
AppStatusListener from being cleaned. For example, a dropped onTaskEnd event, 
will prevent the task from being removed from liveTasks map, and the task will 
remain in the heap until the driver's JVM is stopped.

We were able to confirm our analysis by reducing the capacity of the 
AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
having launched many spark queries using this config, we observed that the 
number of active jobs in Spark UI increased rapidly and remained high even 
though all submitted queries have completed. We have also noticed that some 
executor task counters in Spark UI were negative, which confirms that 
AppStatusListener state does not accurately reflect the reality and that it can 
be a victim of event drops.

Suggested fix:
There are some limits today on the number of "dead" objects in 
AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
enforcing another configurable limit on the number of total objects in 
AppStatusListener's maps and kvstore. This should limit the leak in the case of 
high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43302) Make Python UDAF an AggregateFunction

2023-05-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723061#comment-17723061
 ] 

ASF GitHub Bot commented on SPARK-43302:


User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/41142

> Make Python UDAF an AggregateFunction
> -
>
> Key: SPARK-43302
> URL: https://issues.apache.org/jira/browse/SPARK-43302
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR

2023-05-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723059#comment-17723059
 ] 

ASF GitHub Bot commented on SPARK-43518:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41179

> Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
> ---
>
> Key: SPARK-43518
> URL: https://issues.apache.org/jira/browse/SPARK-43518
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43457) [PYTHON][CONNECT] user agent should include the OS and Python versions



 [ 
https://issues.apache.org/jira/browse/SPARK-43457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43457:


Assignee: Niranjan Jayakar

> [PYTHON][CONNECT] user agent should include the OS and Python versions
> --
>
> Key: SPARK-43457
> URL: https://issues.apache.org/jira/browse/SPARK-43457
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>
> Including OS and Python versions in the user agent improves tracking to see 
> how Spark Connect is used across Python versions and the different platforms 
> it's used from



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43457) [PYTHON][CONNECT] user agent should include the OS and Python versions



 [ 
https://issues.apache.org/jira/browse/SPARK-43457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43457.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41138
[https://github.com/apache/spark/pull/41138]

> [PYTHON][CONNECT] user agent should include the OS and Python versions
> --
>
> Key: SPARK-43457
> URL: https://issues.apache.org/jira/browse/SPARK-43457
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
> Fix For: 3.5.0
>
>
> Including OS and Python versions in the user agent improves tracking to see 
> how Spark Connect is used across Python versions and the different platforms 
> it's used from



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'



 [ 
https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heedo Lee updated SPARK-43522:
--
Description: 
When creating a struct column in Dataframe, the code that ran without problems 
in version 3.3.1 does not work in version 3.4.0.

 

Example
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0), split(x, "=").getItem(1) ) )){code}
 

In 3.3.1

 
{code:java}
 
testDF.show()
+---+---++ 
|      value|      key_value|           map_entry| 
+---+---++ 
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
+---+---++
 
testDF.printSchema()
root
 |-- value: string (nullable = true)
 |-- key_value: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- map_entry: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
{code}
 

 

In 3.4.0

 
{code:java}
org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve 
"struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, 
-1)[1])" due to data type mismatch: Only foldable `STRING` expressions are 
allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, 
lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda 
x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

 
{code}
 

However, if you do an alias to struct elements, you can get the same result as 
the previous version.

 
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0).as("col1") , split(x, "=").getItem(1).as("col2") ) )){code}
 

 

  was:
When creating a struct column in Dataframe, the code that ran without problems 
in version 3.3.1 does not work in version 3.4.0.

 

Example
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0), split(x, "=").getItem(1) ) )){code}
 

In 3.3.1

 
{code:java}
 
testDF.show()
+---+---++ 
|      value|      key_value|           map_entry| 
+---+---++ 
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
+---+---++
 
root
 |-- value: string (nullable = true)
 |-- key_value: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- map_entry: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
{code}
 

 

In 3.4.0

 
{code:java}
org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve 
"struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, 
-1)[1])" due to data type mismatch: Only foldable `STRING` expressions are 
allowed

[jira] [Updated] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'



 [ 
https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heedo Lee updated SPARK-43522:
--
Description: 
When creating a struct column in Dataframe, the code that ran without problems 
in version 3.3.1 does not work in version 3.4.0.

 

Example
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0), split(x, "=").getItem(1) ) )){code}
 

In 3.3.1

 
{code:java}
 
testDF.show()
+---+---++ 
|      value|      key_value|           map_entry| 
+---+---++ 
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
+---+---++
 
root
 |-- value: string (nullable = true)
 |-- key_value: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- map_entry: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
{code}
 

 

In 3.4.0

 
{code:java}
org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve 
"struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, 
-1)[1])" due to data type mismatch: Only foldable `STRING` expressions are 
allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, 
lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda 
x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

 
{code}
 

However, if you do an alias to struct elements, you can get the same result as 
the previous version.

 
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0).as("col1") , split(x, "=").getItem(1).as("col2") ) )){code}
 

 

  was:
When creating a struct column in Dataframe, the code that ran without problems 
in version 3.3.1 does not work in version 3.4.0.

 

Example
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0), split(x, "=").getItem(1) ) )){code}
 

In 3.3.1

 
{code:java}
 
testDF.show()
+---+---++ 
|      value|      key_value|           map_entry| 
+---+---++ 
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
+---+---++
 
root
 |-- value: string (nullable = true)
 |-- key_value: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- map_entry: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |-- aaa: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true) {code}
 

 

In 3.4.0

 
{code:java}
org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve 
"struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =,

[jira] [Updated] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'



 [ 
https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heedo Lee updated SPARK-43522:
--
Description: 
When creating a struct column in Dataframe, the code that ran without problems 
in version 3.3.1 does not work in version 3.4.0.

 

Example
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0), split(x, "=").getItem(1) ) )){code}
 

In 3.3.1

 
{code:java}
 
testDF.show()
+---+---++ 
|      value|      key_value|           map_entry| 
+---+---++ 
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
+---+---++
 
root
 |-- value: string (nullable = true)
 |-- key_value: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- map_entry: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |-- aaa: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true) {code}
 

 

In 3.4.0

 
{code:java}
org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve 
"struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, 
-1)[1])" due to data type mismatch: Only foldable `STRING` expressions are 
allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, 
lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda 
x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

 
{code}
 

However, if you do an alias to struct elements, you can get the same result as 
the previous version.

 
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0).as("col1") , split(x, "=").getItem(1).as("col2") ) )){code}
 

 

  was:
When creating a struct column in Dataframe, the code that ran without problems 
in version 3.3.1 does not work in version 3.4.0.

 

Example
{code:java}
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, 
"=").getItem(0), split(x, "=").getItem(1) ) )){code}
 

In 3.3.1

 
{code:java}
 
testDF.show()
+---+---++ |      value|      
key_value|           map_entry| 
+---+---++ |a=b,c=d,d=f|[a=b, c=d, 
d=f]|[{a, b}, {c, d}, ...| +---+---++
 
testDF.printSchema
root  |-- value: string (nullable = true)  |-- key_value: array (nullable = 
true)  |    |-- element: string (containsNull = false)  |-- map_entry: array 
(nullable = true)  |    |-- element: struct (containsNull = false)  |    |    
|-- col1: string (nullable = true)  |    |    |-- col2: string (nullable = true)
{code}
 

 

In 3.4.0

 
{code:java}
org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve 
"struct(split(namedlambdavariable(), =, -1)[0],

[jira] [Created] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'