[jira] [Created] (SPARK-43533) Enable MultiIndex test for IndexesTests.test_difference
Haejoon Lee created SPARK-43533: --- Summary: Enable MultiIndex test for IndexesTests.test_difference Key: SPARK-43533 URL: https://issues.apache.org/jira/browse/SPARK-43533 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Enable MultiIndex test for IndexesTests.test_difference -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43532) Upgrade `jdbc` related test dependencies
[ https://issues.apache.org/jira/browse/SPARK-43532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43532: - Assignee: BingKun Pan > Upgrade `jdbc` related test dependencies > > > Key: SPARK-43532 > URL: https://issues.apache.org/jira/browse/SPARK-43532 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43532) Upgrade `jdbc` related test dependencies
[ https://issues.apache.org/jira/browse/SPARK-43532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43532. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41194 [https://github.com/apache/spark/pull/41194] > Upgrade `jdbc` related test dependencies > > > Key: SPARK-43532 > URL: https://issues.apache.org/jira/browse/SPARK-43532 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723338#comment-17723338 ] Yuming Wang commented on SPARK-43526: - Why do you prefer shuffle hash join? > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43509) Support creating multiple sessions for Spark Connect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-43509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723336#comment-17723336 ] Snoot.io commented on SPARK-43509: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/41013 > Support creating multiple sessions for Spark Connect in PySpark > --- > > Key: SPARK-43509 > URL: https://issues.apache.org/jira/browse/SPARK-43509 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43461) Skip compiling useless files when making distribution
[ https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43461: Fix Version/s: 3.5.0 > Skip compiling useless files when making distribution > - > > Key: SPARK-43461 > URL: https://issues.apache.org/jira/browse/SPARK-43461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.5.0 > > > -Dmaven.javadoc.skip=true to skip java doc > -Dskip=true to skip scala doc. Please see: > https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip > -Dmaven.source.skip to skip build sources.jar > -Dmaven.test.skip to skip build test-jar > -Dcyclonedx.skip=true to skip making bom. Please see: > https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43461) Skip compiling useless files when making distribution
[ https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-43461. -- Resolution: Fixed Issue resolved by pull request 41141 https://github.com/apache/spark/pull/41141 > Skip compiling useless files when making distribution > - > > Key: SPARK-43461 > URL: https://issues.apache.org/jira/browse/SPARK-43461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > -Dmaven.javadoc.skip=true to skip java doc > -Dskip=true to skip scala doc. Please see: > https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip > -Dmaven.source.skip to skip build sources.jar > -Dmaven.test.skip to skip build test-jar > -Dcyclonedx.skip=true to skip making bom. Please see: > https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43461) Skip compiling useless files when making distribution
[ https://issues.apache.org/jira/browse/SPARK-43461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-43461: Assignee: Yuming Wang > Skip compiling useless files when making distribution > - > > Key: SPARK-43461 > URL: https://issues.apache.org/jira/browse/SPARK-43461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > -Dmaven.javadoc.skip=true to skip java doc > -Dskip=true to skip scala doc. Please see: > https://davidb.github.io/scala-maven-plugin/doc-jar-mojo.html#skip > -Dmaven.source.skip to skip build sources.jar > -Dmaven.test.skip to skip build test-jar > -Dcyclonedx.skip=true to skip making bom. Please see: > https://cyclonedx.github.io/cyclonedx-maven-plugin/makeBom-mojo.html#skip -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43531) Enable more parity tests for Pandas UDFs.
[ https://issues.apache.org/jira/browse/SPARK-43531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43531: - Assignee: Takuya Ueshin > Enable more parity tests for Pandas UDFs. > - > > Key: SPARK-43531 > URL: https://issues.apache.org/jira/browse/SPARK-43531 > Project: Spark > Issue Type: Test > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43531) Enable more parity tests for Pandas UDFs.
[ https://issues.apache.org/jira/browse/SPARK-43531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43531. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41193 [https://github.com/apache/spark/pull/41193] > Enable more parity tests for Pandas UDFs. > - > > Key: SPARK-43531 > URL: https://issues.apache.org/jira/browse/SPARK-43531 > Project: Spark > Issue Type: Test > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43488) bitmap function
[ https://issues.apache.org/jira/browse/SPARK-43488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yiku123 updated SPARK-43488: Description: maybe spark need to have some bitmap functions? example like bitmapBuild 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。 This is often used in user profiling applications but i don't find in spark h2. was: maybe spark need to have some bitmap functions? example like bitmapBuild 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。 This is often used in user profiling applications but i don't find in spark h2. > bitmap function > --- > > Key: SPARK-43488 > URL: https://issues.apache.org/jira/browse/SPARK-43488 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: yiku123 >Priority: Major > > maybe spark need to have some bitmap functions? example like bitmapBuild > 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。 > This is often used in user profiling applications but i don't find in spark > > > h2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43532) Upgrade `jdbc` related test dependencies
BingKun Pan created SPARK-43532: --- Summary: Upgrade `jdbc` related test dependencies Key: SPARK-43532 URL: https://issues.apache.org/jira/browse/SPARK-43532 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43524) Memory leak in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-43524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43524. - Resolution: Duplicate > Memory leak in Spark UI > --- > > Key: SPARK-43524 > URL: https://issues.apache.org/jira/browse/SPARK-43524 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4 >Reporter: Amine Bagdouri >Priority: Major > > We have a distributed Spark application running on Azure HDInsight using > Spark version 2.4.4. > After a few days of active processing on our application, we have noticed > that the GC CPU time ratio of the driver is close to 100%. We suspected a > memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse > Memory Analyzer. > Here is some interesting data from the driver's heap dump (heap size is 8 GB): > * The estimated retained heap size of String objects (~5M instances) is 3.3 > GB. It seems that most of these instances correspond to spark events. > * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. > * The number of LiveJob objects with status "RUNNING" is 18K, knowing that > there shouldn't be more than 16 live running jobs since we use a fixed size > thread pool of 16 threads to run spark queries. > * The number of LiveTask objects is 485K. > * The AsyncEventQueue instance associated to the AppStatusListener has a > value of 854 for dropped events count and a value of 10001 for total events > count, knowing that the dropped events counter is reset every minute and that > the queue's default capacity is 1. > We think that there is a memory leak in Spark UI. Here is our analysis of the > root cause of this leak: > * AppStatusListener is notified of Spark events using a bounded queue in > AsyncEventQueue. > * AppStatusListener updates its state (kvstore, liveTasks, liveStages, > liveJobs, ...) based on the received events. For example, onTaskStart adds a > task to liveTasks map and onTaskEnd removes the task from liveTasks map. > * When the rate of events is very high, the bounded queue in AsyncEventQueue > is full, some events are dropped and don't make it to AppStatusListener. > * Dropped events that signal the end of a processing unit prevent the state > of AppStatusListener from being cleaned. For example, a dropped onTaskEnd > event, will prevent the task from being removed from liveTasks map, and the > task will remain in the heap until the driver's JVM is stopped. > We were able to confirm our analysis by reducing the capacity of the > AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After > having launched many spark queries using this config, we observed that the > number of active jobs in Spark UI increased rapidly and remained high even > though all submitted queries have completed. We have also noticed that some > executor task counters in Spark UI were negative, which confirms that > AppStatusListener state does not accurately reflect the reality and that it > can be a victim of event drops. > Suggested fix: > There are some limits today on the number of "dead" objects in > AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest > enforcing another configurable limit on the number of total objects in > AppStatusListener's maps and kvstore. This should limit the leak in the case > of high events rate, but AppStatusListener stats will remain inaccurate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET
[ https://issues.apache.org/jira/browse/SPARK-43521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43521: Issue Type: New Feature (was: Bug) > Support CREATE TABLE LIKE FILE for PARQUET > -- > > Key: SPARK-43521 > URL: https://issues.apache.org/jira/browse/SPARK-43521 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > ref: https://issues.apache.org/jira/browse/HIVE-26395 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43531) Enable more parity tests for Pandas UDFs.
Takuya Ueshin created SPARK-43531: - Summary: Enable more parity tests for Pandas UDFs. Key: SPARK-43531 URL: https://issues.apache.org/jira/browse/SPARK-43531 Project: Spark Issue Type: Test Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43525) Enhance ImportOrderChecker rules for `group.scala`
[ https://issues.apache.org/jira/browse/SPARK-43525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43525: Assignee: BingKun Pan > Enhance ImportOrderChecker rules for `group.scala` > -- > > Key: SPARK-43525 > URL: https://issues.apache.org/jira/browse/SPARK-43525 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43525) Enhance ImportOrderChecker rules for `group.scala`
[ https://issues.apache.org/jira/browse/SPARK-43525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43525. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41185 [https://github.com/apache/spark/pull/41185] > Enhance ImportOrderChecker rules for `group.scala` > -- > > Key: SPARK-43525 > URL: https://issues.apache.org/jira/browse/SPARK-43525 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43528) Support duplicated field names in createDataFrame with pandas DataFrame.
[ https://issues.apache.org/jira/browse/SPARK-43528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43528. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41190 [https://github.com/apache/spark/pull/41190] > Support duplicated field names in createDataFrame with pandas DataFrame. > > > Key: SPARK-43528 > URL: https://issues.apache.org/jira/browse/SPARK-43528 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43528) Support duplicated field names in createDataFrame with pandas DataFrame.
[ https://issues.apache.org/jira/browse/SPARK-43528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43528: Assignee: Takuya Ueshin > Support duplicated field names in createDataFrame with pandas DataFrame. > > > Key: SPARK-43528 > URL: https://issues.apache.org/jira/browse/SPARK-43528 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43527) Fix catalog.listCatalogs in PySpark
[ https://issues.apache.org/jira/browse/SPARK-43527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43527. -- Fix Version/s: 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 41186 [https://github.com/apache/spark/pull/41186] > Fix catalog.listCatalogs in PySpark > --- > > Key: SPARK-43527 > URL: https://issues.apache.org/jira/browse/SPARK-43527 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0, 3.4.1, 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Critical > Fix For: 3.5.0, 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43527) Fix catalog.listCatalogs in PySpark
[ https://issues.apache.org/jira/browse/SPARK-43527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43527: Assignee: Ruifeng Zheng > Fix catalog.listCatalogs in PySpark > --- > > Key: SPARK-43527 > URL: https://issues.apache.org/jira/browse/SPARK-43527 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0, 3.4.1, 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43360) Scala Connect: Add StreamingQueryManager API
[ https://issues.apache.org/jira/browse/SPARK-43360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43360. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41039 [https://github.com/apache/spark/pull/41039] > Scala Connect: Add StreamingQueryManager API > > > Key: SPARK-43360 > URL: https://issues.apache.org/jira/browse/SPARK-43360 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43360) Scala Connect: Add StreamingQueryManager API
[ https://issues.apache.org/jira/browse/SPARK-43360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43360: Assignee: Wei Liu > Scala Connect: Add StreamingQueryManager API > > > Key: SPARK-43360 > URL: https://issues.apache.org/jira/browse/SPARK-43360 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43530) Protobuf: Read descriptor file only once at the compile time
Raghu Angadi created SPARK-43530: Summary: Protobuf: Read descriptor file only once at the compile time Key: SPARK-43530 URL: https://issues.apache.org/jira/browse/SPARK-43530 Project: Spark Issue Type: Task Components: Protobuf Affects Versions: 3.5.0 Reporter: Raghu Angadi Fix For: 3.5.0 Protobuf functions read from the descriptor file many time (e.g. at each executor). This is unncessary and error prone (e.g. what if the contents change couple of days after the streaming query starts?). It only needs to be read once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43529) Support general expressions as OPTIONS values
Daniel created SPARK-43529: -- Summary: Support general expressions as OPTIONS values Key: SPARK-43529 URL: https://issues.apache.org/jira/browse/SPARK-43529 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43528) Support duplicated field names in createDataFrame with pandas DataFrame.
Takuya Ueshin created SPARK-43528: - Summary: Support duplicated field names in createDataFrame with pandas DataFrame. Key: SPARK-43528 URL: https://issues.apache.org/jira/browse/SPARK-43528 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42958) Refactor `CheckConnectJvmClientCompatibility` to compare client and avro
[ https://issues.apache.org/jira/browse/SPARK-42958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42958. --- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed > Refactor `CheckConnectJvmClientCompatibility` to compare client and avro > > > Key: SPARK-42958 > URL: https://issues.apache.org/jira/browse/SPARK-42958 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Environment: Scala version: 2.12.17 Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. Spark 3.3.2 deployed on cluster was used to check the issue on real data. was: Scala version: 2.12.17 Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. Spark 3.3.1 deployed on cluster was used to check the issue on real data. > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.2, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.2 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have > at least 1 non zero
[jira] [Resolved] (SPARK-43043) Improve the performance of MapOutputTracker.updateMapOutput
[ https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-43043. -- Fix Version/s: 3.4.1 Resolution: Done > Improve the performance of MapOutputTracker.updateMapOutput > --- > > Key: SPARK-43043 > URL: https://issues.apache.org/jira/browse/SPARK-43043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.4.1 > > > Inside of MapOutputTracker, there is a line of code which does a linear find > through a mapStatuses collection: > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 > (plus a similar search a few lines down at > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174) > This scan is necessary because we only know the mapId of the updated status > and not its mapPartitionId. > We perform this scan once per migrated block, so if a large proportion of all > blocks in the map are migrated then we get O(n^2) total runtime across all of > the calls. > I think we might be able to fix this by extending ShuffleStatus to have an > OpenHashMap mapping from mapId to mapPartitionId. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43359) DELETE from Hive table result in INTERNAL error
[ https://issues.apache.org/jira/browse/SPARK-43359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43359: - Assignee: BingKun Pan > DELETE from Hive table result in INTERNAL error > --- > > Key: SPARK-43359 > URL: https://issues.apache.org/jira/browse/SPARK-43359 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: BingKun Pan >Priority: Minor > > spark-sql (default)> CREATE TABLE T1(c1 INT); > spark-sql (default)> DELETE FROM T1 WHERE c1 = 1; > [INTERNAL_ERROR] Unexpected table relation: HiveTableRelation > [`spark_catalog`.`default`.`t1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], > Partition Cols: []] > org.apache.spark.SparkException: [INTERNAL_ERROR] Unexpected table relation: > HiveTableRelation [`spark_catalog`.`default`.`t1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], > Partition Cols: []] > at > org.apache.spark.SparkException$.internalError(SparkException.scala:77) > at > org.apache.spark.SparkException$.internalError(SparkException.scala:81) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy.apply(DataSourceV2Strategy.scala:310) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43359) DELETE from Hive table result in INTERNAL error
[ https://issues.apache.org/jira/browse/SPARK-43359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43359. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41172 [https://github.com/apache/spark/pull/41172] > DELETE from Hive table result in INTERNAL error > --- > > Key: SPARK-43359 > URL: https://issues.apache.org/jira/browse/SPARK-43359 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > > spark-sql (default)> CREATE TABLE T1(c1 INT); > spark-sql (default)> DELETE FROM T1 WHERE c1 = 1; > [INTERNAL_ERROR] Unexpected table relation: HiveTableRelation > [`spark_catalog`.`default`.`t1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], > Partition Cols: []] > org.apache.spark.SparkException: [INTERNAL_ERROR] Unexpected table relation: > HiveTableRelation [`spark_catalog`.`default`.`t1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#3], > Partition Cols: []] > at > org.apache.spark.SparkException$.internalError(SparkException.scala:77) > at > org.apache.spark.SparkException$.internalError(SparkException.scala:81) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy.apply(DataSourceV2Strategy.scala:310) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723216#comment-17723216 ] Svyatoslav Semenyuk commented on SPARK-43514: - We applied "current workaround" to application code and this does not solve the issue. > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.2, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.1 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have > at least 1 non zero entry. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) > at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) > ... many
[jira] [Updated] (SPARK-43514) Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML features caused by certain SQL functions
[ https://issues.apache.org/jira/browse/SPARK-43514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svyatoslav Semenyuk updated SPARK-43514: Affects Version/s: 3.3.2 (was: 3.3.1) > Unexpected NullPointerException or IllegalArgumentException inside UDFs of ML > features caused by certain SQL functions > -- > > Key: SPARK-43514 > URL: https://issues.apache.org/jira/browse/SPARK-43514 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.3.2, 3.4.0 > Environment: Scala version: 2.12.17 > Test examples were executed inside Zeppelin 0.10.1 with Spark 3.4.0. > Spark 3.3.1 deployed on cluster was used to check the issue on real data. >Reporter: Svyatoslav Semenyuk >Priority: Major > Labels: ml, sql > > We designed a function that joins two DFs on common column with some > similarity. All next code will be on Scala 2.12. > I've added {{show}} calls for demonstration purposes. > {code:scala} > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, > RegexTokenizer, MinHashLSHModel} > import org.apache.spark.sql.{DataFrame, Column} > /** > * Joins two data frames on a string column using LSH algorithm > * for similarity computation. > * > * If input data frames have columns with identical names, > * the resulting dataframe will have columns from them both > * with prefixes `datasetA` and `datasetB` respectively. > * > * For example, if both dataframes have a column with name `myColumn`, > * then the result will have columns `datasetAMyColumn` and > `datasetBMyColumn`. > */ > def similarityJoin( > df: DataFrame, > anotherDf: DataFrame, > joinExpr: String, > threshold: Double = 0.8, > ): DataFrame = { > df.show(false) > anotherDf.show(false) > val pipeline = new Pipeline().setStages(Array( > new RegexTokenizer() > .setPattern("") > .setMinTokenLength(1) > .setInputCol(joinExpr) > .setOutputCol("tokens"), > new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"), > new HashingTF().setInputCol("ngrams").setOutputCol("vectors"), > new MinHashLSH().setInputCol("vectors").setOutputCol("lsh"), > ) > ) > val model = pipeline.fit(df) > val storedHashed = model.transform(df) > val landedHashed = model.transform(anotherDf) > val commonColumns = df.columns.toSet & anotherDf.columns.toSet > /** > * Converts column name from a data frame to the column of resulting > dataset. > */ > def convertColumn(datasetName: String)(columnName: String): Column = { > val newName = > if (commonColumns.contains(columnName)) > s"$datasetName${columnName.capitalize}" > else columnName > col(s"$datasetName.$columnName") as newName > } > val columnsToSelect = df.columns.map(convertColumn("datasetA")) ++ > anotherDf.columns.map(convertColumn("datasetB")) > val result = model > .stages > .last > .asInstanceOf[MinHashLSHModel] > .approxSimilarityJoin(storedHashed, landedHashed, threshold, > "confidence") > .select(columnsToSelect.toSeq: _*) > result.show(false) > result > } > {code} > Now consider such simple example: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example runs with no errors and outputs 3 empty DFs. Let's add > {{distinct}} method to one data frame: > {code:scala} > val inputDF1 = Seq("", null).toDF("name").distinct().filter(length($"name") > > 2) as "df1" > val inputDF2 = Seq("", null).toDF("name").filter(length($"name") > 2) as "df2" > similarityJoin(inputDF1, inputDF2, "name", 0.6) > {code} > This example outputs two empty DFs and then fails at {{result.show(false)}}. > Error: > {code:none} > org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user > defined function (LSHModel$$Lambda$3769/0x000101804840: > (struct,values:array>) => > array,values:array>>). > ... many elided > Caused by: java.lang.IllegalArgumentException: requirement failed: Must have > at least 1 non zero entry. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.ml.feature.MinHashLSHModel.hashFunction(MinHashLSH.scala:61) > at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99) > ... many more > {code} > > Now let's take a look on the example
[jira] [Resolved] (SPARK-43520) Upgrade mysql-connector-java from 8.0.32 to 8.0.33
[ https://issues.apache.org/jira/browse/SPARK-43520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43520. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41182 [https://github.com/apache/spark/pull/41182] > Upgrade mysql-connector-java from 8.0.32 to 8.0.33 > -- > > Key: SPARK-43520 > URL: https://issues.apache.org/jira/browse/SPARK-43520 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43520) Upgrade mysql-connector-java from 8.0.32 to 8.0.33
[ https://issues.apache.org/jira/browse/SPARK-43520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43520: - Assignee: BingKun Pan > Upgrade mysql-connector-java from 8.0.32 to 8.0.33 > -- > > Key: SPARK-43520 > URL: https://issues.apache.org/jira/browse/SPARK-43520 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38469) Use error classes in org.apache.spark.network
[ https://issues.apache.org/jira/browse/SPARK-38469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38469: Assignee: Bo Zhang > Use error classes in org.apache.spark.network > - > > Key: SPARK-38469 > URL: https://issues.apache.org/jira/browse/SPARK-38469 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38469) Use error classes in org.apache.spark.network
[ https://issues.apache.org/jira/browse/SPARK-38469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38469. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41140 [https://github.com/apache/spark/pull/41140] > Use error classes in org.apache.spark.network > - > > Key: SPARK-38469 > URL: https://issues.apache.org/jira/browse/SPARK-38469 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
[ https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43512. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41175 [https://github.com/apache/spark/pull/41175] > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade > - > > Key: SPARK-43512 > URL: https://issues.apache.org/jira/browse/SPARK-43512 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 3.5.0 > > > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
[ https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43512: - Assignee: Anish Shrigondekar > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade > - > > Key: SPARK-43512 > URL: https://issues.apache.org/jira/browse/SPARK-43512 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
[ https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43512: -- Issue Type: Test (was: Task) > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade > - > > Key: SPARK-43512 > URL: https://issues.apache.org/jira/browse/SPARK-43512 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Anish Shrigondekar >Priority: Major > > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43512) Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade
[ https://issues.apache.org/jira/browse/SPARK-43512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43512: -- Affects Version/s: 3.5.0 (was: 3.4.0) > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade > - > > Key: SPARK-43512 > URL: https://issues.apache.org/jira/browse/SPARK-43512 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Anish Shrigondekar >Priority: Major > > Update stateStoreOperationsBenchmark to allow rocksdb jni upgrade -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
[ https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723158#comment-17723158 ] Jia Fan commented on SPARK-43522: - https://github.com/apache/spark/pull/41187 > Creating struct column occurs error 'org.apache.spark.sql.AnalysisException > [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]' > - > > Key: SPARK-43522 > URL: https://issues.apache.org/jira/browse/SPARK-43522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Heedo Lee >Priority: Minor > > When creating a struct column in Dataframe, the code that ran without > problems in version 3.3.1 does not work in version 3.4.0. > > Example > {code:java} > val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, > ",")).withColumn("map_entry", transform(col("key_value"), x => > struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} > > In 3.3.1 > > {code:java} > > testDF.show() > +---+---++ > | value| key_value| map_entry| > +---+---++ > |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| > +---+---++ > > testDF.printSchema() > root > |-- value: string (nullable = true) > |-- key_value: array (nullable = true) > | |-- element: string (containsNull = false) > |-- map_entry: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- col1: string (nullable = true) > | | |-- col2: string (nullable = true) > {code} > > > In 3.4.0 > > {code:java} > org.apache.spark.sql.AnalysisException: > [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot > resolve "struct(split(namedlambdavariable(), =, -1)[0], > split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only > foldable `STRING` expressions are allowed to appear at odd position, but they > are ["0", "1"].; > 'Project [value#41, key_value#45, transform(key_value#45, > lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda > x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] > +- Project [value#41, split(value#41, ,, -1) AS key_value#45] > +- LocalRelation [value#41] at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > > > {code} > > However, if you do an alias to struct elements, you can get the same result > as the previous version. > > {code:java} > val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, > ",")).withColumn("map_entry", transform(col("key_value"), x => > struct(split(x, "=").getItem(0).as("col1") , split(x, > "=").getItem(1).as("col2") ) )){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: image-2023-05-16-21-23-33-611.png) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: image-2023-05-16-21-22-44-532.png) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: image-2023-05-16-21-20-18-727.png) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: application_1684208757063_0028_90.html) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: application_1684208757063_0028_90.html > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: application_1684208757063_0028_90.html, > image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-44-163.png|width=935,height=64! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-11-514.png|width=922,height=67! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-44-163.png|width=935,height=64! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-11-514.png|width=922,height=67! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: application_1684208757063_0028_90.html, > image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-44-163.png|width=935,height=64! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-11-514.png|width=922,height=67! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, > image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43527) Fix catalog.listCatalogs in PySpark
[ https://issues.apache.org/jira/browse/SPARK-43527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43527: -- Summary: Fix catalog.listCatalogs in PySpark (was: Fix catalog.listCatalogs) > Fix catalog.listCatalogs in PySpark > --- > > Key: SPARK-43527 > URL: https://issues.apache.org/jira/browse/SPARK-43527 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0, 3.4.1, 3.5.0 >Reporter: Ruifeng Zheng >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-28-11-514.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, > image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-28-44-163.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, > image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43527) Fix catalog.listCatalogs
Ruifeng Zheng created SPARK-43527: - Summary: Fix catalog.listCatalogs Key: SPARK-43527 URL: https://issues.apache.org/jira/browse/SPARK-43527 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.0, 3.4.1, 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1190,height=78! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1190,height=78! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1190,height=78! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! and When shuffledHashJoin is enabled, gc is very serious. !image-2023-05-16-21-12-24-618.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-15-21-047.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png! > !image-2023-05-16-21-21-35-493.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png! > !image-2023-05-16-21-22-16-170.png! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=1340,height=92! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=1340,height=92! > !image-2023-05-16-21-21-35-493.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png! > !image-2023-05-16-21-22-16-170.png! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-24-09-182.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=1340,height=92! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png! > !image-2023-05-16-21-22-16-170.png! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-23-35-237.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-23-33-611.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-22-16-170.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-22-44-532.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-21-35-493.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-20-18-727.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! and When shuffledHashJoin is enabled, gc is very serious. !image-2023-05-16-21-12-24-618.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-15-21-047.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. >From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin). enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! And When shuffledHashJoin is enabled, gc is very serious !image-2023-05-16-21-12-24-618.png! But sortMergeJoin executes without this problem !image-2023-05-16-21-15-21-047.png! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
caican created SPARK-43526: -- Summary: when shuffle hash join is enabled, q95 performance deteriorates Key: SPARK-43526 URL: https://issues.apache.org/jira/browse/SPARK-43526 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0, 3.1.2 Reporter: caican Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. >From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin). enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! And When shuffledHashJoin is enabled, gc is very serious !image-2023-05-16-21-12-24-618.png! But sortMergeJoin executes without this problem !image-2023-05-16-21-15-21-047.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-39281: Assignee: Jia Fan > Speed up Timestamp type inference of legacy format in JSON/CSV data source > -- > > Key: SPARK-39281 > URL: https://issues.apache.org/jira/browse/SPARK-39281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Jia Fan >Priority: Major > > The optimization of {{DefaultTimestampFormatter}} has been implemented in > [#36562|https://github.com/apache/spark/pull/36562] , this ticket adds the > optimization of legacy format. The basic logic is to prevent the formatter > from throwing exceptions, and then use catch to determine whether the parsing > is successful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39281. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41091 [https://github.com/apache/spark/pull/41091] > Speed up Timestamp type inference of legacy format in JSON/CSV data source > -- > > Key: SPARK-39281 > URL: https://issues.apache.org/jira/browse/SPARK-39281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Jia Fan >Priority: Major > Fix For: 3.5.0 > > > The optimization of {{DefaultTimestampFormatter}} has been implemented in > [#36562|https://github.com/apache/spark/pull/36562] , this ticket adds the > optimization of legacy format. The basic logic is to prevent the formatter > from throwing exceptions, and then use catch to determine whether the parsing > is successful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43504) [K8S] Mounts the hadoop config map on the executor pod
[ https://issues.apache.org/jira/browse/SPARK-43504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723109#comment-17723109 ] Nikita Awasthi commented on SPARK-43504: User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/41181 > [K8S] Mounts the hadoop config map on the executor pod > -- > > Key: SPARK-43504 > URL: https://issues.apache.org/jira/browse/SPARK-43504 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > Since SPARK-25815 [,|https://github.com/apache/spark/pull/22911,] the hadoop > config map will not be mounted on the executor pod. > Per the [https://github.com/apache/spark/pull/22911] description: > {code:java} > The main two things that don't need to happen in executors anymore are: > 1. adding the Hadoop config to the executor pods: this is not needed > since the Spark driver will serialize the Hadoop config and send > it to executors when running tasks. {code} > But in fact, the executor still need the hadoop configuration. > > !https://user-images.githubusercontent.com/6757692/238268640-8ff41144-5812-4232-b572-2de2408348ed.png! > > As shown in above picture, the driver can resolve `hdfs://zeus`, but the > executor can not. > so we still need to mount the hadoop config map in executor side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43524) Memory leak in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-43524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amine Bagdouri updated SPARK-43524: --- Description: We have a distributed Spark application running on Azure HDInsight using Spark version 2.4.4. After a few days of active processing on our application, we have noticed that the GC CPU time ratio of the driver is close to 100%. We suspected a memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory Analyzer. Here is some interesting data from the driver's heap dump (heap size is 8 GB): * The estimated retained heap size of String objects (~5M instances) is 3.3 GB. It seems that most of these instances correspond to spark events. * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. * The number of LiveJob objects with status "RUNNING" is 18K, knowing that there shouldn't be more than 16 live running jobs since we use a fixed size thread pool of 16 threads to run spark queries. * The number of LiveTask objects is 485K. * The AsyncEventQueue instance associated to the AppStatusListener has a value of 854 for dropped events count and a value of 10001 for total events count, knowing that the dropped events counter is reset every minute and that the queue's default capacity is 1. We think that there is a memory leak in Spark UI. Here is our analysis of the root cause of this leak: * AppStatusListener is notified of Spark events using a bounded queue in AsyncEventQueue. * AppStatusListener updates its state (kvstore, liveTasks, liveStages, liveJobs, ...) based on the received events. For example, onTaskStart adds a task to liveTasks map and onTaskEnd removes the task from liveTasks map. * When the rate of events is very high, the bounded queue in AsyncEventQueue is full, some events are dropped and don't make it to AppStatusListener. * Dropped events that signal the end of a processing unit prevent the state of AppStatusListener from being cleaned. For example, a dropped onTaskEnd event, will prevent the task from being removed from liveTasks map, and the task will remain in the heap until the driver's JVM is stopped. We were able to confirm our analysis by reducing the capacity of the AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After having launched many spark queries using this config, we observed that the number of active jobs in Spark UI increased rapidly and remained high even though all submitted queries have completed. We have also noticed that some executor task counters in Spark UI were negative, which confirms that AppStatusListener state does not accurately reflect the reality and that it can be a victim of event drops. Suggested fix: There are some limits today on the number of "dead" objects in AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest enforcing another configurable limit on the number of total objects in AppStatusListener's maps and kvstore. This should limit the leak in the case of high events rate, but AppStatusListener stats will remain inaccurate. was: We have a distributed Spark application running on Azure HDInsight using Spark version 2.4.4. After a few days of active processing on our application, we have noticed that the GC CPU time ratio of the driver is close to 100%. We suspected a memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory Analyzer. Here is some interesting data from the driver's heap dump (heap size is 8 GB): * The estimated retained heap size of String objects (~5M instances) is 3.3 GB. It seems that most of these instances correspond to spark events. * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. * The number of LiveJob objects with status "RUNNING" is 18K, knowing that there shouldn't be more than 16 live running jobs since we use a fixed thread pool of 16 threads to run spark queries. * The number of LiveTask objects is 485K. * The AsyncEventQueue instance associated to the AppStatusListener has a value of 854 for dropped events count and a value of 10001 for total events count, knowing that the dropped events counter is reset every minute and that the queue's default capacity is 1. We think that there is a memory leak in Spark UI. Here is our analysis of the root cause of this leak: * AppStatusListener is notified of Spark events using a bounded queue in AsyncEventQueue. * AppStatusListener updates its state (kvstore, liveTasks, liveStages, liveJobs, ...) based on the received events. For example, onTaskStart adds a task to liveTasks map and onTaskEnd removes the task from liveTasks map. * When the rate of events is very high, the bounded queue in AsyncEventQueue is full, some events are dropped and don't make it to AppStatusListener. * Dropped events that signal the end of a processing unit prevent the state
[jira] [Created] (SPARK-43525) Enhance ImportOrderChecker rules for `group.scala`
BingKun Pan created SPARK-43525: --- Summary: Enhance ImportOrderChecker rules for `group.scala` Key: SPARK-43525 URL: https://issues.apache.org/jira/browse/SPARK-43525 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
[ https://issues.apache.org/jira/browse/SPARK-43518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-43518: Assignee: BingKun Pan > Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR > --- > > Key: SPARK-43518 > URL: https://issues.apache.org/jira/browse/SPARK-43518 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
[ https://issues.apache.org/jira/browse/SPARK-43518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43518. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41179 [https://github.com/apache/spark/pull/41179] > Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR > --- > > Key: SPARK-43518 > URL: https://issues.apache.org/jira/browse/SPARK-43518 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43524) Memory leak in Spark UI
Amine Bagdouri created SPARK-43524: -- Summary: Memory leak in Spark UI Key: SPARK-43524 URL: https://issues.apache.org/jira/browse/SPARK-43524 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.4 Reporter: Amine Bagdouri We have a distributed Spark application running on Azure HDInsight using Spark version 2.4.4. After a few days of active processing on our application, we have noticed that the GC CPU time ratio of the driver is close to 100%. We suspected a memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory Analyzer. Here is some interesting data from the driver's heap dump (heap size is 8 GB): * The estimated retained heap size of String objects (~5M instances) is 3.3 GB. It seems that most of these instances correspond to spark events. * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. * The number of LiveJob objects with status "RUNNING" is 18K, knowing that there shouldn't be more than 16 live running jobs since we use a fixed thread pool of 16 threads to run spark queries. * The number of LiveTask objects is 485K. * The AsyncEventQueue instance associated to the AppStatusListener has a value of 854 for dropped events count and a value of 10001 for total events count, knowing that the dropped events counter is reset every minute and that the queue's default capacity is 1. We think that there is a memory leak in Spark UI. Here is our analysis of the root cause of this leak: * AppStatusListener is notified of Spark events using a bounded queue in AsyncEventQueue. * AppStatusListener updates its state (kvstore, liveTasks, liveStages, liveJobs, ...) based on the received events. For example, onTaskStart adds a task to liveTasks map and onTaskEnd removes the task from liveTasks map. * When the rate of events is very high, the bounded queue in AsyncEventQueue is full, some events are dropped and don't make it to AppStatusListener. * Dropped events that signal the end of a processing unit prevent the state of AppStatusListener from being cleaned. For example, a dropped onTaskEnd event, will prevent the task from being removed from liveTasks map, and the task will remain in the heap until the driver's JVM is stopped. We were able to confirm our analysis by reducing the capacity of the AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After having launched many spark queries using this config, we observed that the number of active jobs in Spark UI increased rapidly and remained high even though all submitted queries have completed. We have also noticed that some executor task counters in Spark UI were negative, which confirms that AppStatusListener state does not accurately reflect the reality and that it can be a victim of event drops. Suggested fix: There are some limits today on the number of "dead" objects in AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest enforcing another configurable limit on the number of total objects in AppStatusListener's maps and kvstore. This should limit the leak in the case of high events rate, but AppStatusListener stats will remain inaccurate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43523) Memory leak in Spark UI
Amine Bagdouri created SPARK-43523: -- Summary: Memory leak in Spark UI Key: SPARK-43523 URL: https://issues.apache.org/jira/browse/SPARK-43523 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.4 Reporter: Amine Bagdouri We have a distributed Spark application running on Azure HDInsight using Spark version 2.4.4. After a few days of active processing on our application, we have noticed that the GC CPU time ratio of the driver is close to 100%. We suspected a memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse Memory Analyzer. Here is some interesting data from the driver's heap dump (heap size is 8 GB): * The estimated retained heap size of String objects (~5M instances) is 3.3 GB. It seems that most of these instances correspond to spark events. * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. * The number of LiveJob objects with status "RUNNING" is 18K, knowing that there shouldn't be more than 16 live running jobs since we use a fixed thread pool of 16 threads to run spark queries. * The number of LiveTask objects is 485K. * The AsyncEventQueue instance associated to the AppStatusListener has a value of 854 for dropped events count and a value of 10001 for total events count, knowing that the dropped events counter is reset every minute and that the queue's default capacity is 1. We think that there is a memory leak in Spark UI. Here is our analysis of the root cause of this leak: * AppStatusListener is notified of Spark events using a bounded queue in AsyncEventQueue. * AppStatusListener updates its state (kvstore, liveTasks, liveStages, liveJobs, ...) based on the received events. For example, onTaskStart adds a task to liveTasks map and onTaskEnd removes the task from liveTasks map. * When the rate of events is very high, the bounded queue in AsyncEventQueue is full, some events are dropped and don't make it to AppStatusListener. * Dropped events that signal the end of a processing unit prevent the state of AppStatusListener from being cleaned. For example, a dropped onTaskEnd event, will prevent the task from being removed from liveTasks map, and the task will remain in the heap until the driver's JVM is stopped. We were able to confirm our analysis by reducing the capacity of the AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After having launched many spark queries using this config, we observed that the number of active jobs in Spark UI increased rapidly and remained high even though all submitted queries have completed. We have also noticed that some executor task counters in Spark UI were negative, which confirms that AppStatusListener state does not accurately reflect the reality and that it can be a victim of event drops. Suggested fix: There are some limits today on the number of "dead" objects in AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest enforcing another configurable limit on the number of total objects in AppStatusListener's maps and kvstore. This should limit the leak in the case of high events rate, but AppStatusListener stats will remain inaccurate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43302) Make Python UDAF an AggregateFunction
[ https://issues.apache.org/jira/browse/SPARK-43302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723061#comment-17723061 ] ASF GitHub Bot commented on SPARK-43302: User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/41142 > Make Python UDAF an AggregateFunction > - > > Key: SPARK-43302 > URL: https://issues.apache.org/jira/browse/SPARK-43302 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43518) Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR
[ https://issues.apache.org/jira/browse/SPARK-43518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723059#comment-17723059 ] ASF GitHub Bot commented on SPARK-43518: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41179 > Convert `_LEGACY_ERROR_TEMP_2029` to INTERNAL_ERROR > --- > > Key: SPARK-43518 > URL: https://issues.apache.org/jira/browse/SPARK-43518 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43457) [PYTHON][CONNECT] user agent should include the OS and Python versions
[ https://issues.apache.org/jira/browse/SPARK-43457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43457: Assignee: Niranjan Jayakar > [PYTHON][CONNECT] user agent should include the OS and Python versions > -- > > Key: SPARK-43457 > URL: https://issues.apache.org/jira/browse/SPARK-43457 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > > Including OS and Python versions in the user agent improves tracking to see > how Spark Connect is used across Python versions and the different platforms > it's used from -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43457) [PYTHON][CONNECT] user agent should include the OS and Python versions
[ https://issues.apache.org/jira/browse/SPARK-43457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43457. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41138 [https://github.com/apache/spark/pull/41138] > [PYTHON][CONNECT] user agent should include the OS and Python versions > -- > > Key: SPARK-43457 > URL: https://issues.apache.org/jira/browse/SPARK-43457 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Fix For: 3.5.0 > > > Including OS and Python versions in the user agent improves tracking to see > how Spark Connect is used across Python versions and the different platforms > it's used from -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
[ https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Heedo Lee updated SPARK-43522: -- Description: When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. Example {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} In 3.3.1 {code:java} testDF.show() +---+---++ | value| key_value| map_entry| +---+---++ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +---+---++ testDF.printSchema() root |-- value: string (nullable = true) |-- key_value: array (nullable = true) | |-- element: string (containsNull = false) |-- map_entry: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = true) {code} In 3.4.0 {code:java} org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].; 'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] +- Project [value#41, split(value#41, ,, -1) AS key_value#45] +- LocalRelation [value#41] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) {code} However, if you do an alias to struct elements, you can get the same result as the previous version. {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0).as("col1") , split(x, "=").getItem(1).as("col2") ) )){code} was: When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. Example {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} In 3.3.1 {code:java} testDF.show() +---+---++ | value| key_value| map_entry| +---+---++ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +---+---++ root |-- value: string (nullable = true) |-- key_value: array (nullable = true) | |-- element: string (containsNull = false) |-- map_entry: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = true) {code} In 3.4.0 {code:java} org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed
[jira] [Updated] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
[ https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Heedo Lee updated SPARK-43522: -- Description: When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. Example {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} In 3.3.1 {code:java} testDF.show() +---+---++ | value| key_value| map_entry| +---+---++ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +---+---++ root |-- value: string (nullable = true) |-- key_value: array (nullable = true) | |-- element: string (containsNull = false) |-- map_entry: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = true) {code} In 3.4.0 {code:java} org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].; 'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] +- Project [value#41, split(value#41, ,, -1) AS key_value#45] +- LocalRelation [value#41] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) {code} However, if you do an alias to struct elements, you can get the same result as the previous version. {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0).as("col1") , split(x, "=").getItem(1).as("col2") ) )){code} was: When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. Example {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} In 3.3.1 {code:java} testDF.show() +---+---++ | value| key_value| map_entry| +---+---++ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +---+---++ root |-- value: string (nullable = true) |-- key_value: array (nullable = true) | |-- element: string (containsNull = false) |-- map_entry: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = true) |-- aaa: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) {code} In 3.4.0 {code:java} org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =,
[jira] [Updated] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
[ https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Heedo Lee updated SPARK-43522: -- Description: When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. Example {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} In 3.3.1 {code:java} testDF.show() +---+---++ | value| key_value| map_entry| +---+---++ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +---+---++ root |-- value: string (nullable = true) |-- key_value: array (nullable = true) | |-- element: string (containsNull = false) |-- map_entry: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = true) |-- aaa: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) {code} In 3.4.0 {code:java} org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].; 'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] +- Project [value#41, split(value#41, ,, -1) AS key_value#45] +- LocalRelation [value#41] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) {code} However, if you do an alias to struct elements, you can get the same result as the previous version. {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0).as("col1") , split(x, "=").getItem(1).as("col2") ) )){code} was: When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. Example {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} In 3.3.1 {code:java} testDF.show() +---+---++ | value| key_value| map_entry| +---+---++ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +---+---++ testDF.printSchema root |-- value: string (nullable = true) |-- key_value: array (nullable = true) | |-- element: string (containsNull = false) |-- map_entry: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = true) {code} In 3.4.0 {code:java} org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0],
[jira] [Created] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
Heedo Lee created SPARK-43522: - Summary: Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]' Key: SPARK-43522 URL: https://issues.apache.org/jira/browse/SPARK-43522 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Heedo Lee When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. Example {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code} In 3.3.1 {code:java} testDF.show() +---+---++ | value| key_value| map_entry| +---+---++ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +---+---++ testDF.printSchema root |-- value: string (nullable = true) |-- key_value: array (nullable = true) | |-- element: string (containsNull = false) |-- map_entry: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = true) {code} In 3.4.0 {code:java} org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].; 'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] +- Project [value#41, split(value#41, ,, -1) AS key_value#45] +- LocalRelation [value#41] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) {code} However, if you do an alias to struct elements, you can get the same result as the previous version. {code:java} val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0).as("col1") , split(x, "=").getItem(1).as("col2") ) )){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET
melin created SPARK-43521: - Summary: Support CREATE TABLE LIKE FILE for PARQUET Key: SPARK-43521 URL: https://issues.apache.org/jira/browse/SPARK-43521 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: melin ref: https://issues.apache.org/jira/browse/HIVE-26395 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org