[jira] [Updated] (SPARK-47471) Support order-insensitive lateral column alias
[ https://issues.apache.org/jira/browse/SPARK-47471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-47471: Component/s: SQL (was: Block Manager) > Support order-insensitive lateral column alias > -- > > Key: SPARK-47471 > URL: https://issues.apache.org/jira/browse/SPARK-47471 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.4 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47471) Support order-insensitive lateral column alias
Yuming Wang created SPARK-47471: --- Summary: Support order-insensitive lateral column alias Key: SPARK-47471 URL: https://issues.apache.org/jira/browse/SPARK-47471 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 3.3.4 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47459) Cancel running stage if the result is empty relation
Yuming Wang created SPARK-47459: --- Summary: Cancel running stage if the result is empty relation Key: SPARK-47459 URL: https://issues.apache.org/jira/browse/SPARK-47459 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Yuming Wang Attachments: task stack trace.png How to reproduce: bin/spark-sql --master yarn --conf spark.driver.host=10.211.174.53 {code:sql} set spark.sql.adaptive.enabled=true; select a from (select id as a, id as b, id as z from range(1)) t1 join (select id as c, id as d from range(2)) t2 on t1.a = t2.c join (select id as e, id as f from range(3)) t3 on t2.d = t3.e where z % 10 < 0 group by 1; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47459) Cancel running stage if the result is empty relation
[ https://issues.apache.org/jira/browse/SPARK-47459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-47459: Attachment: task stack trace.png > Cancel running stage if the result is empty relation > > > Key: SPARK-47459 > URL: https://issues.apache.org/jira/browse/SPARK-47459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Yuming Wang >Priority: Major > Attachments: task stack trace.png > > > How to reproduce: > bin/spark-sql --master yarn --conf spark.driver.host=10.211.174.53 > {code:sql} > set spark.sql.adaptive.enabled=true; > select a from (select id as a, id as b, id as z from range(1)) t1 > join (select id as c, id as d from range(2)) t2 on t1.a = t2.c > join (select id as e, id as f from range(3)) t3 on t2.d = t3.e > where z % 10 < 0 > group by 1; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47441) Do not add log link for unmanaged AM in Spark UI
Yuming Wang created SPARK-47441: --- Summary: Do not add log link for unmanaged AM in Spark UI Key: SPARK-47441 URL: https://issues.apache.org/jira/browse/SPARK-47441 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 3.5.1, 3.5.0 Reporter: Yuming Wang {noformat} 24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception java.lang.NumberFormatException: For input string: "null" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) ~[?:?] at java.lang.Integer.parseInt(Integer.java:668) ~[?:?] at java.lang.Integer.parseInt(Integer.java:786) ~[?:?] at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) ~[scala-library-2.12.18.jar:?] at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.18.jar:?] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) ~[scala-library-2.12.18.jar:?] at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) [spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) [spark-core_2.12-3.5.1.jar:3.5.1] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47222) fileCompressionFactor should be applied to the size of the table
[ https://issues.apache.org/jira/browse/SPARK-47222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821923#comment-17821923 ] Yuming Wang commented on SPARK-47222: - https://github.com/apache/spark/pull/45329 > fileCompressionFactor should be applied to the size of the table > > > Key: SPARK-47222 > URL: https://issues.apache.org/jira/browse/SPARK-47222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47222) fileCompressionFactor should be applied to the size of the table
Yuming Wang created SPARK-47222: --- Summary: fileCompressionFactor should be applied to the size of the table Key: SPARK-47222 URL: https://issues.apache.org/jira/browse/SPARK-47222 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46885) Push down filters through TypedFilter
Yuming Wang created SPARK-46885: --- Summary: Push down filters through TypedFilter Key: SPARK-46885 URL: https://issues.apache.org/jira/browse/SPARK-46885 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40609) Casts types according to bucket info for Equality expression
[ https://issues.apache.org/jira/browse/SPARK-40609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-40609. - Fix Version/s: 4.0.0 Resolution: Duplicate Issue fixed by SPARK-46219. > Casts types according to bucket info for Equality expression > > > Key: SPARK-40609 > URL: https://issues.apache.org/jira/browse/SPARK-40609 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46219) Unwrap cast in join predicates
[ https://issues.apache.org/jira/browse/SPARK-46219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-46219: Summary: Unwrap cast in join predicates (was: Unwrapp cast in join predicates) > Unwrap cast in join predicates > -- > > Key: SPARK-46219 > URL: https://issues.apache.org/jira/browse/SPARK-46219 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46219) Unwrapp cast in join predicates
Yuming Wang created SPARK-46219: --- Summary: Unwrapp cast in join predicates Key: SPARK-46219 URL: https://issues.apache.org/jira/browse/SPARK-46219 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46069) Support unwrap timestamp type to date type
[ https://issues.apache.org/jira/browse/SPARK-46069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-46069. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43982 [https://github.com/apache/spark/pull/43982] > Support unwrap timestamp type to date type > -- > > Key: SPARK-46069 > URL: https://issues.apache.org/jira/browse/SPARK-46069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46069) Support unwrap timestamp type to date type
[ https://issues.apache.org/jira/browse/SPARK-46069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-46069: --- Assignee: Wan Kun > Support unwrap timestamp type to date type > -- > > Key: SPARK-46069 > URL: https://issues.apache.org/jira/browse/SPARK-46069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43228) Join keys also match PartitioningCollection
[ https://issues.apache.org/jira/browse/SPARK-43228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43228. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44128 [https://github.com/apache/spark/pull/44128] > Join keys also match PartitioningCollection > --- > > Key: SPARK-43228 > URL: https://issues.apache.org/jira/browse/SPARK-43228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43228) Join keys also match PartitioningCollection
[ https://issues.apache.org/jira/browse/SPARK-43228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-43228: --- Assignee: Yuming Wang > Join keys also match PartitioningCollection > --- > > Key: SPARK-43228 > URL: https://issues.apache.org/jira/browse/SPARK-43228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46122) Disable spark.sql.legacy.createHiveTableByDefault by default
[ https://issues.apache.org/jira/browse/SPARK-46122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-46122: Summary: Disable spark.sql.legacy.createHiveTableByDefault by default (was: Enable spark.sql.legacy.createHiveTableByDefault by default) > Disable spark.sql.legacy.createHiveTableByDefault by default > > > Key: SPARK-46122 > URL: https://issues.apache.org/jira/browse/SPARK-46122 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46122) Enable spark.sql.legacy.createHiveTableByDefault by default
Yuming Wang created SPARK-46122: --- Summary: Enable spark.sql.legacy.createHiveTableByDefault by default Key: SPARK-46122 URL: https://issues.apache.org/jira/browse/SPARK-46122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46119) Override toString method for UnresolvedAlias
Yuming Wang created SPARK-46119: --- Summary: Override toString method for UnresolvedAlias Key: SPARK-46119 URL: https://issues.apache.org/jira/browse/SPARK-46119 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46102) Prune keys or values from Generate if it is a map type
[ https://issues.apache.org/jira/browse/SPARK-46102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-46102: Summary: Prune keys or values from Generate if it is a map type (was: Prune keys or values from Generate if it is a map type.) > Prune keys or values from Generate if it is a map type > -- > > Key: SPARK-46102 > URL: https://issues.apache.org/jira/browse/SPARK-46102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46102) Prune keys or values from Generate if it is a map type.
Yuming Wang created SPARK-46102: --- Summary: Prune keys or values from Generate if it is a map type. Key: SPARK-46102 URL: https://issues.apache.org/jira/browse/SPARK-46102 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46097) Push down limit 1 though Union and Aggregate
[ https://issues.apache.org/jira/browse/SPARK-46097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789619#comment-17789619 ] Yuming Wang commented on SPARK-46097: - https://github.com/apache/spark/pull/44009 > Push down limit 1 though Union and Aggregate > > > Key: SPARK-46097 > URL: https://issues.apache.org/jira/browse/SPARK-46097 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46097) Push down limit 1 though Union and Aggregate
Yuming Wang created SPARK-46097: --- Summary: Push down limit 1 though Union and Aggregate Key: SPARK-46097 URL: https://issues.apache.org/jira/browse/SPARK-46097 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45954) Avoid generating redundant ShuffleExchangeExec node
[ https://issues.apache.org/jira/browse/SPARK-45954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45954: Summary: Avoid generating redundant ShuffleExchangeExec node (was: Remove redundant shuffles) > Avoid generating redundant ShuffleExchangeExec node > --- > > Key: SPARK-45954 > URL: https://issues.apache.org/jira/browse/SPARK-45954 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45954) Remove redundant shuffles
Yuming Wang created SPARK-45954: --- Summary: Remove redundant shuffles Key: SPARK-45954 URL: https://issues.apache.org/jira/browse/SPARK-45954 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Description: We should set the view name to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! was: Need to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > We should set the view name to > sparkSession.sparkContext.setJobDescription("xxx") > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Description: Need to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Need to sparkSession.sparkContext.setJobDescription("xxx") > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45947) Set a human readable description for Dataset api
Yuming Wang created SPARK-45947: --- Summary: Set a human readable description for Dataset api Key: SPARK-45947 URL: https://issues.apache.org/jira/browse/SPARK-45947 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang Attachments: screenshot-1.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Attachment: screenshot-1.png > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45915) Treat decimal(x, 0) the same as IntegralType in PromoteStrings
[ https://issues.apache.org/jira/browse/SPARK-45915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45915: Summary: Treat decimal(x, 0) the same as IntegralType in PromoteStrings (was: Unwrap cast in predicate) > Treat decimal(x, 0) the same as IntegralType in PromoteStrings > -- > > Key: SPARK-45915 > URL: https://issues.apache.org/jira/browse/SPARK-45915 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45915) Unwrap cast in predicate
Yuming Wang created SPARK-45915: --- Summary: Unwrap cast in predicate Key: SPARK-45915 URL: https://issues.apache.org/jira/browse/SPARK-45915 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45909) Remove the cast if it can safely up-cast in IsNotNull
Yuming Wang created SPARK-45909: --- Summary: Remove the cast if it can safely up-cast in IsNotNull Key: SPARK-45909 URL: https://issues.apache.org/jira/browse/SPARK-45909 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size
[ https://issues.apache.org/jira/browse/SPARK-45894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45894: Target Version/s: (was: 3.5.0) > hive table level setting hadoop.mapred.max.split.size > - > > Key: SPARK-45894 > URL: https://issues.apache.org/jira/browse/SPARK-45894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: guihuawen >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In the scenario of hive table scan, by configuring the > hadoop.mapred.max.split.size parameter, you can increase the parallelism of > the scan hive table stage, thereby reducing the running time. > However, if a large table and a small table are in the same query, if only a > separate hadoop.mapred.max.split.size parameter is configured, some stages > will run a very large number of tasks, and some stages will The number of > tasks running is very small. For runtime tasks, the > hadoop.mapred.max.split.size parameter can be set separately for each hive > table to ensure this balance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45895) Combine multiple like to like all
Yuming Wang created SPARK-45895: --- Summary: Combine multiple like to like all Key: SPARK-45895 URL: https://issues.apache.org/jira/browse/SPARK-45895 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang {code:scala} spark.sql("create table t(a string, b string, c string) using parquet") spark.sql( """ |select * from t where |substr(a, 1, 5) like '%a%' and |substr(a, 1, 5) like '%b%' |""".stripMargin).explain(true) {code} We can optimize the query to: {code:scala} spark.sql( """ |select * from t where |substr(a, 1, 5) like all('%a%', '%b%') |""".stripMargin).explain(true) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45853) Add Iceberg and Hudi to third party projects
Yuming Wang created SPARK-45853: --- Summary: Add Iceberg and Hudi to third party projects Key: SPARK-45853 URL: https://issues.apache.org/jira/browse/SPARK-45853 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Yuming Wang {noformat} Error: org.apache.hive.service.cli.HiveSQLException: Error running query: java.util.concurrent.ExecutionException: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: iceberg. Please find packages at `https://spark.apache.org/third-party-projects.html`. at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161) at java.base/java.security.AccessController.doPrivileged(AccessController.java:712) at java.base/javax.security.auth.Subject.doAs(Subject.java:439) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45848) spark-build-info.ps1 missing the docroot property
[ https://issues.apache.org/jira/browse/SPARK-45848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45848: Description: https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 https://github.com/apache/spark/blob/master/build/spark-build-info#L30-L36 was:https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 > spark-build-info.ps1 missing the docroot property > - > > Key: SPARK-45848 > URL: https://issues.apache.org/jira/browse/SPARK-45848 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 > https://github.com/apache/spark/blob/master/build/spark-build-info#L30-L36 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45848) spark-build-info.ps1 missing the docroot property
Yuming Wang created SPARK-45848: --- Summary: spark-build-info.ps1 missing the docroot property Key: SPARK-45848 URL: https://issues.apache.org/jira/browse/SPARK-45848 Project: Spark Issue Type: Bug Components: Build Affects Versions: 4.0.0 Reporter: Yuming Wang https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45755) Push down limit through Dataset.isEmpty()
[ https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45755: Description: Push down LocalLimit can not optimize the case of distinct. {code:scala} def isEmpty: Boolean = withAction("isEmpty", withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) }.queryExecution) { plan => plan.executeTake(1).isEmpty } {code} > Push down limit through Dataset.isEmpty() > - > > Key: SPARK-45755 > URL: https://issues.apache.org/jira/browse/SPARK-45755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > Push down LocalLimit can not optimize the case of distinct. > {code:scala} > def isEmpty: Boolean = withAction("isEmpty", > withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) > }.queryExecution) { plan => > plan.executeTake(1).isEmpty > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45755) Push down limit through Dataset.isEmpty()
Yuming Wang created SPARK-45755: --- Summary: Push down limit through Dataset.isEmpty() Key: SPARK-45755 URL: https://issues.apache.org/jira/browse/SPARK-45755 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45658: Target Version/s: (was: 3.5.1) > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Asif >Priority: Major > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds relative to buildQuery output > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45658: Affects Version/s: (was: 3.5.1) > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds relative to buildQuery output > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1591#comment-1591 ] Yuming Wang commented on SPARK-43851: - The resolution should be unresolved. > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-43851: - Assignee: (was: Jia Fan) > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.5.0 > > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43851: Fix Version/s: (was: 3.5.0) > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45454) Set the table's default owner to current_user
[ https://issues.apache.org/jira/browse/SPARK-45454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45454: Parent: (was: SPARK-30016) Issue Type: Improvement (was: Sub-task) > Set the table's default owner to current_user > - > > Key: SPARK-45454 > URL: https://issues.apache.org/jira/browse/SPARK-45454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45454) Set the table's default owner to current_user
[ https://issues.apache.org/jira/browse/SPARK-45454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45454: Summary: Set the table's default owner to current_user (was: Set owner of DS v2 table to CURRENT_USER if it is already set) > Set the table's default owner to current_user > - > > Key: SPARK-45454 > URL: https://issues.apache.org/jira/browse/SPARK-45454 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45454) Set owner of DS v2 table to CURRENT_USER if it is already set
Yuming Wang created SPARK-45454: --- Summary: Set owner of DS v2 table to CURRENT_USER if it is already set Key: SPARK-45454 URL: https://issues.apache.org/jira/browse/SPARK-45454 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45387: Target Version/s: (was: 3.1.1, 3.3.0) > Partition key filter cannot be pushed down when using cast > -- > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 >Reporter: TianyiMa >Priority: Critical > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source, but if the filter condition is number, e.g. dt = 123, that cannot be > pushed down to data source, causing spark to pull all of that table's > partition meta data to client, which is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45369) Push down limit through generate
Yuming Wang created SPARK-45369: --- Summary: Push down limit through generate Key: SPARK-45369 URL: https://issues.apache.org/jira/browse/SPARK-45369 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768399#comment-17768399 ] Yuming Wang commented on SPARK-45282: - cc [~ulysses] [~cloud_fan] > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Major > Labels: CorrectnessBug, correctness > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. > to reproduce on distributed cluster these settings needed are: > {code:java} > spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 > spark.sql.adaptive.coalescePartitions.parallelismFirst false > spark.sql.adaptive.enabled true > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} > code using scala to reproduce is: > {code:java} > import java.util.UUID > import org.apache.spark.sql.functions.col > import spark.implicits._ > val data = (1 to 100).toDS().map(i => > UUID.randomUUID().toString).persist() > val left = data.map(k => (k, 1)) > val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! > println("number of left " + left.count()) > println("number of right " + right.count()) > println("number of (left join right) " + > left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count() > ) > val left1 = left > .toDF("key", "value1") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of left1 " + left1.count()) > val right1 = right > .toDF("key", "value2") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of right1 " + right1.count()) > println("number of (left1 join right1) " + left1.join(right1, > "key").count()) // this gives incorrect result{code} > this produces the following output: > {code:java} > number of left 100 > number of right 100 > number of (left join right) 100 > number of left1 100 > number of right1 100 > number of (left1 join right1) 859531 {code} > note that the last number (the incorrect one) actually varies depending on > settings and cluster size etc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43406. - Resolution: Duplicate > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Target Version/s: (was: 4.0.0) > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Fix Version/s: (was: 3.5.0) > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Target Version/s: 4.0.0 > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45089) Remove obsolete repo of DB2 JDBC driver
[ https://issues.apache.org/jira/browse/SPARK-45089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-45089. - Fix Version/s: 4.0.0 Assignee: Cheng Pan Resolution: Fixed Issue resolved by pull request 42820 https://github.com/apache/spark/pull/42820 > Remove obsolete repo of DB2 JDBC driver > --- > > Key: SPARK-45089 > URL: https://issues.apache.org/jira/browse/SPARK-45089 > Project: Spark > Issue Type: Test > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45071: Fix Version/s: 3.5.1 (was: 3.5.0) > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > Fix For: 3.4.2, 4.0.0, 3.5.1 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-45071. - Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42804 [https://github.com/apache/spark/pull/42804] > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-45071: --- Assignee: ming95 > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45020) org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'default' not found (state=08S01,code=0)
[ https://issues.apache.org/jira/browse/SPARK-45020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45020: Fix Version/s: (was: 3.1.0) > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database > 'default' not found (state=08S01,code=0) > - > > Key: SPARK-45020 > URL: https://issues.apache.org/jira/browse/SPARK-45020 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Sruthi Mooriyathvariam >Priority: Minor > > There is an alert that fires up when a Spark 3.1 cluster is created using > shared metastore with Spark 2.4. The alert says DefaultDatabase does not > exist. This is misleading and thus we need to suppress this alert. > In the class SessionCatalog.scala, the method requireDbExists() is not > handling the case when the db = defaultDB. This needs to be added to suppress > this misleading alert. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44846) PushFoldableIntoBranches in complex grouping expressions may cause bindReference error
[ https://issues.apache.org/jira/browse/SPARK-44846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44846: --- Assignee: zhuml > PushFoldableIntoBranches in complex grouping expressions may cause > bindReference error > -- > > Key: SPARK-44846 > URL: https://issues.apache.org/jira/browse/SPARK-44846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: zhuml >Assignee: zhuml >Priority: Major > > SQL: > {code:java} > select c*2 as d from > (select if(b > 1, 1, b) as c from > (select if(a < 0, 0 ,a) as b from t group by b) t1 > group by c) t2 {code} > ERROR: > {code:java} > Couldn't find _groupingexpression#15 in [if ((_groupingexpression#15 > 1)) 1 > else _groupingexpression#15#16] > java.lang.IllegalStateException: Couldn't find _groupingexpression#15 in [if > ((_groupingexpression#15 > 1)) 1 else _groupingexpression#15#16] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren(TreeNode.scala:1272) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren$(TreeNode.scala:1271) > at > org.apache.spark.sql.catalyst.expressions.If.mapChildren(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1215) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1214) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:533) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:405) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:360) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce(AggregateCodegenSupport.scala:69) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce$(AggregateCodegenSupport.scala:65) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:49) > at > org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) > at > org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92) > at > org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:92) > at
[jira] [Resolved] (SPARK-44846) PushFoldableIntoBranches in complex grouping expressions may cause bindReference error
[ https://issues.apache.org/jira/browse/SPARK-44846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44846. - Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42633 [https://github.com/apache/spark/pull/42633] > PushFoldableIntoBranches in complex grouping expressions may cause > bindReference error > -- > > Key: SPARK-44846 > URL: https://issues.apache.org/jira/browse/SPARK-44846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: zhuml >Assignee: zhuml >Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > > SQL: > {code:java} > select c*2 as d from > (select if(b > 1, 1, b) as c from > (select if(a < 0, 0 ,a) as b from t group by b) t1 > group by c) t2 {code} > ERROR: > {code:java} > Couldn't find _groupingexpression#15 in [if ((_groupingexpression#15 > 1)) 1 > else _groupingexpression#15#16] > java.lang.IllegalStateException: Couldn't find _groupingexpression#15 in [if > ((_groupingexpression#15 > 1)) 1 else _groupingexpression#15#16] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren(TreeNode.scala:1272) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren$(TreeNode.scala:1271) > at > org.apache.spark.sql.catalyst.expressions.If.mapChildren(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1215) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1214) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:533) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:405) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:360) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce(AggregateCodegenSupport.scala:69) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce$(AggregateCodegenSupport.scala:65) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:49) > at > org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) >
[jira] [Updated] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44892: Fix Version/s: (was: 4.0.0) > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44892. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 54 [https://github.com/apache/spark-docker/pull/54] > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44892: --- Assignee: Yuming Wang > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
Yuming Wang created SPARK-44892: --- Summary: Add official image Dockerfile for Spark 3.3.3 Key: SPARK-44892 URL: https://issues.apache.org/jira/browse/SPARK-44892 Project: Spark Issue Type: Sub-task Components: Spark Docker Affects Versions: 3.3.3 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44813) The JIRA Python misses our assignee when it searches user again
[ https://issues.apache.org/jira/browse/SPARK-44813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44813: Fix Version/s: 3.3.4 (was: 3.3.3) > The JIRA Python misses our assignee when it searches user again > --- > > Key: SPARK-44813 > URL: https://issues.apache.org/jira/browse/SPARK-44813 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > > {code:java} > >>> assignee = asf_jira.user("yao") > >>> "SPARK-44801"'SPARK-44801' > >>> asf_jira.assign_issue(issue.key, assignee.name) > response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' > cannot be assigned issues."}} {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44857) Fix getBaseURI error in Spark Worker LogPage UI buttons
[ https://issues.apache.org/jira/browse/SPARK-44857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44857: Fix Version/s: 3.3.4 (was: 3.3.3) > Fix getBaseURI error in Spark Worker LogPage UI buttons > --- > > Key: SPARK-44857 > URL: https://issues.apache.org/jira/browse/SPARK-44857 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.2.0, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > Attachments: Screenshot 2023-08-17 at 2.38.45 PM.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info
[ https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44880: --- Assignee: Kent Yao > Remove unnecessary curly braces at the end of the thread locks info > --- > > Key: SPARK-44880 > URL: https://issues.apache.org/jira/browse/SPARK-44880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Remove unnecessary curly braces at the end of the thread locks info -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info
[ https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44880. - Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42571 [https://github.com/apache/spark/pull/42571] > Remove unnecessary curly braces at the end of the thread locks info > --- > > Key: SPARK-44880 > URL: https://issues.apache.org/jira/browse/SPARK-44880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Remove unnecessary curly braces at the end of the thread locks info -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info
[ https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44880: Fix Version/s: 3.5.1 (was: 3.5.0) > Remove unnecessary curly braces at the end of the thread locks info > --- > > Key: SPARK-44880 > URL: https://issues.apache.org/jira/browse/SPARK-44880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 4.0.0, 3.5.1 > > > Remove unnecessary curly braces at the end of the thread locks info -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44792. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42474 [https://github.com/apache/spark/pull/42474] > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > > https://issues.apache.org/jira/browse/HADOOP-17612 > https://issues.apache.org/jira/browse/HADOOP-18515 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44792: --- Assignee: Yuming Wang > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HADOOP-17612 > https://issues.apache.org/jira/browse/HADOOP-18515 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44792: Description: https://issues.apache.org/jira/browse/HADOOP-17612 https://issues.apache.org/jira/browse/HADOOP-18515 was:https://issues.apache.org/jira/browse/HADOOP-17612 > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HADOOP-17612 > https://issues.apache.org/jira/browse/HADOOP-18515 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44792: Description: https://issues.apache.org/jira/browse/HADOOP-17612 > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HADOOP-17612 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44792) Upgrade curator to 5.2.0
Yuming Wang created SPARK-44792: --- Summary: Upgrade curator to 5.2.0 Key: SPARK-44792 URL: https://issues.apache.org/jira/browse/SPARK-44792 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44700: Fix Version/s: 3.3.0 > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: jiahong.li >Priority: Minor > Fix For: 3.3.0 > > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44700. - Resolution: Fixed Please upgrade Spark to the latest version to fix this issue. > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44700: Affects Version/s: 3.1.1 (was: 3.4.0) (was: 3.4.1) > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys
[ https://issues.apache.org/jira/browse/SPARK-24087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752239#comment-17752239 ] Yuming Wang commented on SPARK-24087: - Fixed by SPARK-35703. > Avoid shuffle when join keys are a super-set of bucket keys > --- > > Key: SPARK-24087 > URL: https://issues.apache.org/jira/browse/SPARK-24087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752023#comment-17752023 ] Yuming Wang commented on SPARK-44719: - There are two ways to fix it: 1. Upgrade the built-in hive to 2.3.10 with the following patch. 2. Revert SPARK-43225. https://github.com/apache/hive/pull/4562 https://github.com/apache/hive/pull/4563 https://github.com/apache/hive/pull/4564 > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > {noformat} > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44719: Description: How to reproduce: {noformat} spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; Time taken: 0.413 seconds spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP'; Time taken: 0.038 seconds spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)] java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) ... {noformat} was: How to reproduce: ``` spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; Time taken: 0.413 seconds spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP'; Time taken: 0.038 seconds spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)] java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) ... ``` > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > {noformat} > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44719: Attachment: HiveUDFs-1.0-SNAPSHOT.jar > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > ``` > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44719) NoClassDefFoundError when using Hive UDF
Yuming Wang created SPARK-44719: --- Summary: NoClassDefFoundError when using Hive UDF Key: SPARK-44719 URL: https://issues.apache.org/jira/browse/SPARK-44719 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 3.5.0 Reporter: Yuming Wang Attachments: HiveUDFs-1.0-SNAPSHOT.jar How to reproduce: ``` spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; Time taken: 0.413 seconds spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP'; Time taken: 0.038 seconds spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)] java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) ... ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-42500. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42038 [https://github.com/apache/spark/pull/42038] > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Tongwei >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-42500: --- Assignee: Tongwei > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Tongwei >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns
[ https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44662: Target Version/s: (was: 3.3.3) > SPIP: Improving performance of BroadcastHashJoin queries with stream side > join key on non partition columns > --- > > Key: SPARK-44662 > URL: https://issues.apache.org/jira/browse/SPARK-44662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.3 >Reporter: Asif >Priority: Major > > h2. *Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon.* > On the lines of DPP which helps DataSourceV2 relations when the joining key > is a partition column, the same concept can be extended over to the case > where joining key is not a partition column. > The Keys of BroadcastHashJoin are already available before actual evaluation > of the stream iterator. These keys can be pushed down to the DataSource as a > SortedSet. > For non partition columns, the DataSources like iceberg have max/min stats on > column available at manifest level, and for formats like parquet , they have > max/min stats at various storage level. The passed SortedSet can be used to > prune using ranges at both driver level ( manifests files) as well as > executor level ( while actually going through chunks , row groups etc at > parquet level) > If the data is stored as Columnar Batch format , then it would not be > possible to filter out individual row at DataSource level, even though we > have keys. > But at the scan level, ( ColumnToRowExec) it is still possible to filter out > as many rows as possible , if the query involves nested joins. Thus reducing > the number of rows to join at the higher join levels. > Will be adding more details.. > h2. *Q2. What problem is this proposal NOT designed to solve?* > This can only help in BroadcastHashJoin's performance if the join is Inner or > Left Semi. > This will also not work if there are nodes like Expand, Generator , Aggregate > (without group by on keys not part of joining column etc) below the > BroadcastHashJoin node being targeted. > h2. *Q3. How is it done today, and what are the limits of current practice?* > Currently this sort of pruning at DataSource level is being done using DPP > (Dynamic Partition Pruning ) and IFF one of the join key column is a > Partitioning column ( so that cost of DPP query is justified and way less > than amount of data it will be filtering by skipping partitions). > The limitation is that DPP type approach is not implemented ( intentionally I > believe), if the join column is a non partition column ( because of cost of > "DPP type" query would most likely be way high as compared to any possible > pruning ( especially if the column is not stored in a sorted manner). > h2. *Q4. What is new in your approach and why do you think it will be > successful?* > 1) This allows pruning on non partition column based joins. > 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP > type" query. > 3) The Data can be used by DataSource to prune at driver (possibly) and also > at executor level ( as in case of parquet which has max/min at various > structure levels) > 4) The big benefit should be seen in multilevel nested join queries. In the > current code base, if I am correct, only one join's pruning filter would get > pushed at scan level. Since it is on partition key may be that is sufficient. > But if it is a nested Join query , and may be involving different columns on > streaming side for join, each such filter push could do significant pruning. > This requires some handling in case of AQE, as the stream side iterator ( & > hence stage evaluation needs to be delayed, till all the available join > filters in the nested tree are pushed at their respective target > BatchScanExec). > h4. *Single Row Filteration* > 5) In case of nested broadcasted joins, if the datasource is column vector > oriented , then what spark would get is a ColumnarBatch. But because scans > have Filters from multiple joins, they can be retrieved and can be applied in > code generated at ColumnToRowExec level, using a new "containsKey" method on > HashedRelation. Thus only those rows which satisfy all the > BroadcastedHashJoins ( whose keys have been pushed) , will be used for join > evaluation. > The code is already there , will be opening a PR. For non partition table > TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing > 15% gain. > For partition table TPCDS, there is improvement in 4 - 5 queries to the tune > of 10% to 37%. > h2. *Q5. Who cares? If you are successful, what difference will it make?* > If use cases involve
[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns
[ https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44662: Fix Version/s: (was: 3.3.3) > SPIP: Improving performance of BroadcastHashJoin queries with stream side > join key on non partition columns > --- > > Key: SPARK-44662 > URL: https://issues.apache.org/jira/browse/SPARK-44662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.3 >Reporter: Asif >Priority: Major > > h2. *Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon.* > On the lines of DPP which helps DataSourceV2 relations when the joining key > is a partition column, the same concept can be extended over to the case > where joining key is not a partition column. > The Keys of BroadcastHashJoin are already available before actual evaluation > of the stream iterator. These keys can be pushed down to the DataSource as a > SortedSet. > For non partition columns, the DataSources like iceberg have max/min stats on > column available at manifest level, and for formats like parquet , they have > max/min stats at various storage level. The passed SortedSet can be used to > prune using ranges at both driver level ( manifests files) as well as > executor level ( while actually going through chunks , row groups etc at > parquet level) > If the data is stored as Columnar Batch format , then it would not be > possible to filter out individual row at DataSource level, even though we > have keys. > But at the scan level, ( ColumnToRowExec) it is still possible to filter out > as many rows as possible , if the query involves nested joins. Thus reducing > the number of rows to join at the higher join levels. > Will be adding more details.. > h2. *Q2. What problem is this proposal NOT designed to solve?* > This can only help in BroadcastHashJoin's performance if the join is Inner or > Left Semi. > This will also not work if there are nodes like Expand, Generator , Aggregate > (without group by on keys not part of joining column etc) below the > BroadcastHashJoin node being targeted. > h2. *Q3. How is it done today, and what are the limits of current practice?* > Currently this sort of pruning at DataSource level is being done using DPP > (Dynamic Partition Pruning ) and IFF one of the join key column is a > Partitioning column ( so that cost of DPP query is justified and way less > than amount of data it will be filtering by skipping partitions). > The limitation is that DPP type approach is not implemented ( intentionally I > believe), if the join column is a non partition column ( because of cost of > "DPP type" query would most likely be way high as compared to any possible > pruning ( especially if the column is not stored in a sorted manner). > h2. *Q4. What is new in your approach and why do you think it will be > successful?* > 1) This allows pruning on non partition column based joins. > 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP > type" query. > 3) The Data can be used by DataSource to prune at driver (possibly) and also > at executor level ( as in case of parquet which has max/min at various > structure levels) > 4) The big benefit should be seen in multilevel nested join queries. In the > current code base, if I am correct, only one join's pruning filter would get > pushed at scan level. Since it is on partition key may be that is sufficient. > But if it is a nested Join query , and may be involving different columns on > streaming side for join, each such filter push could do significant pruning. > This requires some handling in case of AQE, as the stream side iterator ( & > hence stage evaluation needs to be delayed, till all the available join > filters in the nested tree are pushed at their respective target > BatchScanExec). > h4. *Single Row Filteration* > 5) In case of nested broadcasted joins, if the datasource is column vector > oriented , then what spark would get is a ColumnarBatch. But because scans > have Filters from multiple joins, they can be retrieved and can be applied in > code generated at ColumnToRowExec level, using a new "containsKey" method on > HashedRelation. Thus only those rows which satisfy all the > BroadcastedHashJoins ( whose keys have been pushed) , will be used for join > evaluation. > The code is already there , will be opening a PR. For non partition table > TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing > 15% gain. > For partition table TPCDS, there is improvement in 4 - 5 queries to the tune > of 10% to 37%. > h2. *Q5. Who cares? If you are successful, what difference will it make?* > If use cases involve
[jira] [Assigned] (SPARK-44675) Increase ReservedCodeCacheSize for release build
[ https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44675: --- Assignee: Yuming Wang > Increase ReservedCodeCacheSize for release build > > > Key: SPARK-44675 > URL: https://issues.apache.org/jira/browse/SPARK-44675 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44675) Increase ReservedCodeCacheSize for release build
[ https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44675. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42344 [https://github.com/apache/spark/pull/42344] > Increase ReservedCodeCacheSize for release build > > > Key: SPARK-44675 > URL: https://issues.apache.org/jira/browse/SPARK-44675 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44675) Increase ReservedCodeCacheSize for release build
Yuming Wang created SPARK-44675: --- Summary: Increase ReservedCodeCacheSize for release build Key: SPARK-44675 URL: https://issues.apache.org/jira/browse/SPARK-44675 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44654) In subquery cannot perform partition pruning
[ https://issues.apache.org/jira/browse/SPARK-44654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750739#comment-17750739 ] Yuming Wang commented on SPARK-44654: - Another way is convert join to filter if maximum number of rows on one side is 1: https://github.com/apache/spark/pull/42114 > In subquery cannot perform partition pruning > > > Key: SPARK-44654 > URL: https://issues.apache.org/jira/browse/SPARK-44654 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: 7mming7 >Priority: Minor > Labels: performance > Attachments: image-2023-08-03-17-22-53-981.png > > > The following SQL cannot perform partition pruning > {code:java} > SELECT * FROM parquet_part WHERE id_type in (SELECT max(id_type) from > parquet_part){code} > As can be seen from the execution plan below, the partition pruning of left > cannot be performed after the subquery of in is converted into join > !image-2023-08-03-17-22-53-981.png! > The current issue proposes to optimize insubquery. Only when the value of in > is greater than a threshold, insubquery will be converted into Join -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44651) Make do-release-docker.sh compatible with Mac m2
Yuming Wang created SPARK-44651: --- Summary: Make do-release-docker.sh compatible with Mac m2 Key: SPARK-44651 URL: https://issues.apache.org/jira/browse/SPARK-44651 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yuming Wang How to test: {code:sh} dev/create-release/do-release-docker.sh -d /Users/yumwang/release-spark/output -s docs -n {code} Install python3-dev and build-essential: {code:sh} $APT_INSTALL python-is-python3 python3-pip python3-setuptools python3-dev build-essential {code} {noformat} Collecting grpcio==1.56.0 Downloading grpcio-1.56.0.tar.gz (24.3 MB) || 24.3 MB 6.7 MB/s ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-qmfpon02/grpcio/setup.py'"'"'; __file__='"'"'/tmp/pip-install-qmfpon02/grpcio/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-qmfpon02/grpcio/pip-egg-info cwd: /tmp/pip-install-qmfpon02/grpcio/ Complete output (11 lines): Traceback (most recent call last): File "", line 1, in File "/tmp/pip-install-qmfpon02/grpcio/setup.py", line 263, in if check_linker_need_libatomic(): File "/tmp/pip-install-qmfpon02/grpcio/setup.py", line 210, in check_linker_need_libatomic cpp_test = subprocess.Popen(cxx + ['-x', 'c++', '-std=c++14', '-'], File "/usr/lib/python3.8/subprocess.py", line 858, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'c++' ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. ... Could not find . This could mean the following: * You're on Ubuntu and haven't run `apt-get install python3-dev`. * You're on RHEL/Fedora and haven't run `yum install python3-devel` or `dnf install python3-devel` (make sure you also have redhat-rpm-config installed) * You're on Mac OS X and the usual Python framework was somehow corrupted (check your environment variables or try re-installing?) * You're on Windows and your Python installation was somehow corrupted (check your environment variables or try re-installing?) {noformat} {noformat} #5 848.0 Successfully built grpcio future #5 848.0 Failed to build pyarrow #5 848.7 ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly {noformat} {noformat} root@c57ec74c8d32:/# $APT_INSTALL r-base r-base-dev Reading package lists... Done Building dependency tree Reading state information... Done Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: r-base : Depends: r-base-core (>= 4.3.1-3.2004.0) but it is not going to be installed Depends: r-recommended (= 4.3.1-3.2004.0) but it is not going to be installed r-base-dev : Depends: r-base-core (>= 4.3.1-3.2004.0) but it is not going to be installed E: Unable to correct problems, you have held broken packages. {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38506) Push partial aggregation through join
[ https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-38506: Description: Please see https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization for more details. (was: Please see https://docs.teradata.com/r/Teradata-VantageTM-SQL-Request-and-Transaction-Processing/March-2019/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization for more details.) > Push partial aggregation through join > - > > Key: SPARK-38506 > URL: https://issues.apache.org/jira/browse/SPARK-38506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Please see > https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery
[ https://issues.apache.org/jira/browse/SPARK-44562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44562: --- Assignee: Yuming Wang > Add OptimizeOneRowRelationSubquery in batch of Subquery > --- > > Key: SPARK-44562 > URL: https://issues.apache.org/jira/browse/SPARK-44562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery
[ https://issues.apache.org/jira/browse/SPARK-44562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44562. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42180 [https://github.com/apache/spark/pull/42180] > Add OptimizeOneRowRelationSubquery in batch of Subquery > --- > > Key: SPARK-44562 > URL: https://issues.apache.org/jira/browse/SPARK-44562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0
[ https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44598. - Resolution: Not A Problem > spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize > is 0 > -- > > Key: SPARK-44598 > URL: https://issues.apache.org/jira/browse/SPARK-44598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: ming95 >Priority: Major > > We using spark to read a hive table with hbase serde . We found that when the > hbase table data is relatively small (hbase StorefileSize is 0), the data > read by spark 3.2 or 3.5 is empty, and there is no error message. > But when using spark2.4 or hive to read, the data can be read normally. Other > information shows that spark3.1 can also read data normally, can anyone > provide some ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0
[ https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-44598: - > spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize > is 0 > -- > > Key: SPARK-44598 > URL: https://issues.apache.org/jira/browse/SPARK-44598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: ming95 >Priority: Major > > We using spark to read a hive table with hbase serde . We found that when the > hbase table data is relatively small (hbase StorefileSize is 0), the data > read by spark 3.2 or 3.5 is empty, and there is no error message. > But when using spark2.4 or hive to read, the data can be read normally. Other > information shows that spark3.1 can also read data normally, can anyone > provide some ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0
[ https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749418#comment-17749418 ] Yuming Wang commented on SPARK-44598: - How to reproduce this issue? > spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize > is 0 > -- > > Key: SPARK-44598 > URL: https://issues.apache.org/jira/browse/SPARK-44598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: ming95 >Priority: Major > > We using spark to read a hive table with hbase serde . We found that when the > hbase table data is relatively small (hbase StorefileSize is 0), the data > read by spark 3.2 or 3.5 is empty, and there is no error message. > But when using spark2.4 or hive to read, the data can be read normally. Other > information shows that spark3.1 can also read data normally, can anyone > provide some ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44454) HiveShim getTablesByType support fallback
[ https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44454: --- Assignee: dzcxzl > HiveShim getTablesByType support fallback > - > > Key: SPARK-44454 > URL: https://issues.apache.org/jira/browse/SPARK-44454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > > When we use a high version of Hive Client to communicate with a low version > of Hive meta store, we may encounter Invalid method name: > 'get_tables_by_type'. > > {code:java} > 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views > 23/07/17 12:45:24,489 [main] ERROR log: Got exception: > org.apache.thrift.TApplicationException Invalid method name: > 'get_tables_by_type' > org.apache.thrift.TApplicationException: Invalid method name: > 'get_tables_by_type' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) > at > org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44454) HiveShim getTablesByType support fallback
[ https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44454. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42033 [https://github.com/apache/spark/pull/42033] > HiveShim getTablesByType support fallback > - > > Key: SPARK-44454 > URL: https://issues.apache.org/jira/browse/SPARK-44454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 4.0.0 > > > When we use a high version of Hive Client to communicate with a low version > of Hive meta store, we may encounter Invalid method name: > 'get_tables_by_type'. > > {code:java} > 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views > 23/07/17 12:45:24,489 [main] ERROR log: Got exception: > org.apache.thrift.TApplicationException Invalid method name: > 'get_tables_by_type' > org.apache.thrift.TApplicationException: Invalid method name: > 'get_tables_by_type' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) > at > org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44513) Upgrade snappy-java to 1.1.10.3
[ https://issues.apache.org/jira/browse/SPARK-44513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44513: Fix Version/s: 3.4.2 > Upgrade snappy-java to 1.1.10.3 > --- > > Key: SPARK-44513 > URL: https://issues.apache.org/jira/browse/SPARK-44513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.4.2, 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org