[jira] [Created] (SPARK-38326) aditya
Vallepu Durga Aditya created SPARK-38326: Summary: aditya Key: SPARK-38326 URL: https://issues.apache.org/jira/browse/SPARK-38326 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.2.1 Reporter: Vallepu Durga Aditya Fix For: 3.2.1 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-38316. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35652 [https://github.com/apache/spark/pull/35652] > Fix > SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite > under ANSI mode > --- > > Key: SPARK-38316 > URL: https://issues.apache.org/jira/browse/SPARK-38316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode
[ https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38322. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35658 [https://github.com/apache/spark/pull/35658] > Support query stage show runtime statistics in formatted explain mode > - > > Key: SPARK-38322 > URL: https://issues.apache.org/jira/browse/SPARK-38322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.3.0 > > > The formatted explalin mode is the powerful explain mode to show the details > of query plan. In AQE, the query stage know its statistics if has already > materialized. So it can help to quick check the conversion of plan, e.g. join > selection. > A simple example: > {code:java} > SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (21) > +- == Final Plan == >* SortMergeJoin Inner (13) >:- * Sort (6) >: +- AQEShuffleRead (5) >: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) >:+- Exchange (3) >: +- * Filter (2) >: +- Scan hive default.t (1) >+- * Sort (12) > +- AQEShuffleRead (11) > +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Exchange (9) >+- * Filter (8) > +- Scan hive default.t2 (7) > +- == Initial Plan == >SortMergeJoin Inner (20) >:- Sort (16) >: +- Exchange (15) >: +- Filter (14) >:+- Scan hive default.t (1) >+- Sort (19) > +- Exchange (18) > +- Filter (17) > +- Scan hive default.t2 (7){code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode
[ https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38322: --- Assignee: XiDuo You > Support query stage show runtime statistics in formatted explain mode > - > > Key: SPARK-38322 > URL: https://issues.apache.org/jira/browse/SPARK-38322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.3.0 > > > The formatted explalin mode is the powerful explain mode to show the details > of query plan. In AQE, the query stage know its statistics if has already > materialized. So it can help to quick check the conversion of plan, e.g. join > selection. > A simple example: > {code:java} > SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (21) > +- == Final Plan == >* SortMergeJoin Inner (13) >:- * Sort (6) >: +- AQEShuffleRead (5) >: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) >:+- Exchange (3) >: +- * Filter (2) >: +- Scan hive default.t (1) >+- * Sort (12) > +- AQEShuffleRead (11) > +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Exchange (9) >+- * Filter (8) > +- Scan hive default.t2 (7) > +- == Initial Plan == >SortMergeJoin Inner (20) >:- Sort (16) >: +- Exchange (15) >: +- Filter (14) >:+- Scan hive default.t (1) >+- Sort (19) > +- Exchange (18) > +- Filter (17) > +- Scan hive default.t2 (7){code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
[ https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38317. -- Resolution: Not A Problem > Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH" > - > > Key: SPARK-38317 > URL: https://issues.apache.org/jira/browse/SPARK-38317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Jolan Rensen >Priority: Major > > {code} > val dates = Seq( > Period.ZERO, > Period.ofWeeks(2), > ).toDS() > dates.show(false) > {code} > Results in: > {code} > ++ > |value | > ++ > |INTERVAL '0-0' YEAR TO MONTH| > |INTERVAL '0-0' YEAR TO MONTH| > ++ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
[ https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497926#comment-17497926 ] Max Gekk commented on SPARK-38317: -- This is the expected behavior, Spark truncates java.time.Period to months. > Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH" > - > > Key: SPARK-38317 > URL: https://issues.apache.org/jira/browse/SPARK-38317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Jolan Rensen >Priority: Major > > {code} > val dates = Seq( > Period.ZERO, > Period.ofWeeks(2), > ).toDS() > dates.show(false) > {code} > Results in: > {code} > ++ > |value | > ++ > |INTERVAL '0-0' YEAR TO MONTH| > |INTERVAL '0-0' YEAR TO MONTH| > ++ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38189) Support priority scheduling (Introduce priorityClass) with volcano implementations
[ https://issues.apache.org/jira/browse/SPARK-38189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38189: Assignee: Apache Spark > Support priority scheduling (Introduce priorityClass) with volcano > implementations > -- > > Key: SPARK-38189 > URL: https://issues.apache.org/jira/browse/SPARK-38189 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38189) Support priority scheduling (Introduce priorityClass) with volcano implementations
[ https://issues.apache.org/jira/browse/SPARK-38189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497924#comment-17497924 ] Apache Spark commented on SPARK-38189: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/35639 > Support priority scheduling (Introduce priorityClass) with volcano > implementations > -- > > Key: SPARK-38189 > URL: https://issues.apache.org/jira/browse/SPARK-38189 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38189) Support priority scheduling (Introduce priorityClass) with volcano implementations
[ https://issues.apache.org/jira/browse/SPARK-38189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38189: Assignee: (was: Apache Spark) > Support priority scheduling (Introduce priorityClass) with volcano > implementations > -- > > Key: SPARK-38189 > URL: https://issues.apache.org/jira/browse/SPARK-38189 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38323) Support the hidden file metadata in Streaming
[ https://issues.apache.org/jira/browse/SPARK-38323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-38323: Description: Currently, querying the hidden file metadata struct `_metadata` will fail with `readStream`, `writeStream` APIs. {code:java} spark .readStream ... .select("_metadata") .writeStream ... .start(){code} Need to expose the metadata output to `StreamingRelation` as well. > Support the hidden file metadata in Streaming > - > > Key: SPARK-38323 > URL: https://issues.apache.org/jira/browse/SPARK-38323 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > > Currently, querying the hidden file metadata struct `_metadata` will fail > with `readStream`, `writeStream` APIs. > {code:java} > spark > .readStream > ... > .select("_metadata") > .writeStream > ... > .start(){code} > Need to expose the metadata output to `StreamingRelation` as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()
[ https://issues.apache.org/jira/browse/SPARK-38325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497919#comment-17497919 ] Apache Spark commented on SPARK-38325: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/35659 > ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt() > > > Key: SPARK-38325 > URL: https://issues.apache.org/jira/browse/SPARK-38325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > SubqueryBroadcastExec retrieves the partition key from the broadcast results > based on the type of HashedRelation returned. If the key is packed inside a > Long, we extract it through bitwise operations and cast it as Byte/Short/Int > if necessary. > The casting here can cause a potential runtime error. We should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()
[ https://issues.apache.org/jira/browse/SPARK-38325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38325: Assignee: Gengliang Wang (was: Apache Spark) > ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt() > > > Key: SPARK-38325 > URL: https://issues.apache.org/jira/browse/SPARK-38325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > SubqueryBroadcastExec retrieves the partition key from the broadcast results > based on the type of HashedRelation returned. If the key is packed inside a > Long, we extract it through bitwise operations and cast it as Byte/Short/Int > if necessary. > The casting here can cause a potential runtime error. We should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()
[ https://issues.apache.org/jira/browse/SPARK-38325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38325: Assignee: Apache Spark (was: Gengliang Wang) > ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt() > > > Key: SPARK-38325 > URL: https://issues.apache.org/jira/browse/SPARK-38325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > SubqueryBroadcastExec retrieves the partition key from the broadcast results > based on the type of HashedRelation returned. If the key is packed inside a > Long, we extract it through bitwise operations and cast it as Byte/Short/Int > if necessary. > The casting here can cause a potential runtime error. We should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38325) ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()
Gengliang Wang created SPARK-38325: -- Summary: ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt() Key: SPARK-38325 URL: https://issues.apache.org/jira/browse/SPARK-38325 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0, 3.2.2 Reporter: Gengliang Wang Assignee: Gengliang Wang SubqueryBroadcastExec retrieves the partition key from the broadcast results based on the type of HashedRelation returned. If the key is packed inside a Long, we extract it through bitwise operations and cast it as Byte/Short/Int if necessary. The casting here can cause a potential runtime error. We should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval
[ https://issues.apache.org/jira/browse/SPARK-38324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497905#comment-17497905 ] Hyukjin Kwon commented on SPARK-38324: -- cc [~Gengliang.Wang] FYI > The second range is not [0, 59] in the day time ANSI interval > - > > Key: SPARK-38324 > URL: https://issues.apache.org/jira/browse/SPARK-38324 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.0 > Environment: Spark 3.3.0 snapshot >Reporter: chong >Priority: Major > > [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] > * SECOND, seconds within minutes and possibly fractions of a second > [0..59.99]{{{}{}}} > {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}} > > But testing shows 99 second is valid: > {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}} > {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to > second]{}}}{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
[ https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497904#comment-17497904 ] Hyukjin Kwon commented on SPARK-38317: -- cc [~maxgekk] FYI > Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH" > - > > Key: SPARK-38317 > URL: https://issues.apache.org/jira/browse/SPARK-38317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Jolan Rensen >Priority: Major > > {code} > val dates = Seq( > Period.ZERO, > Period.ofWeeks(2), > ).toDS() > dates.show(false) > {code} > Results in: > {code} > ++ > |value | > ++ > |INTERVAL '0-0' YEAR TO MONTH| > |INTERVAL '0-0' YEAR TO MONTH| > ++ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
[ https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38317: - Description: {code} val dates = Seq( Period.ZERO, Period.ofWeeks(2), ).toDS() dates.show(false) {code} Results in: {code} ++ |value | ++ |INTERVAL '0-0' YEAR TO MONTH| |INTERVAL '0-0' YEAR TO MONTH| ++ {code} was: ```val dates = Seq( Period.ZERO, Period.ofWeeks(2), ).toDS() dates.show(false)``` Results in: ``` ++ |value | ++ |INTERVAL '0-0' YEAR TO MONTH| |INTERVAL '0-0' YEAR TO MONTH| ++ ``` > Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH" > - > > Key: SPARK-38317 > URL: https://issues.apache.org/jira/browse/SPARK-38317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Jolan Rensen >Priority: Major > > {code} > val dates = Seq( > Period.ZERO, > Period.ofWeeks(2), > ).toDS() > dates.show(false) > {code} > Results in: > {code} > ++ > |value | > ++ > |INTERVAL '0-0' YEAR TO MONTH| > |INTERVAL '0-0' YEAR TO MONTH| > ++ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38323) Support the hidden file metadata in Streaming
[ https://issues.apache.org/jira/browse/SPARK-38323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497903#comment-17497903 ] Hyukjin Kwon commented on SPARK-38323: -- [~yaohua] mind fill the description? > Support the hidden file metadata in Streaming > - > > Key: SPARK-38323 > URL: https://issues.apache.org/jira/browse/SPARK-38323 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
[ https://issues.apache.org/jira/browse/SPARK-38317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38317: - Component/s: SQL (was: Spark Core) > Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH" > - > > Key: SPARK-38317 > URL: https://issues.apache.org/jira/browse/SPARK-38317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Jolan Rensen >Priority: Major > > ```val dates = Seq( > Period.ZERO, > Period.ofWeeks(2), > ).toDS() > dates.show(false)``` > Results in: > ``` > ++ > |value | > ++ > |INTERVAL '0-0' YEAR TO MONTH| > |INTERVAL '0-0' YEAR TO MONTH| > ++ > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37614) Support ANSI Aggregate Function: regr_avgx & regr_avgy
[ https://issues.apache.org/jira/browse/SPARK-37614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37614. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34868 [https://github.com/apache/spark/pull/34868] > Support ANSI Aggregate Function: regr_avgx & regr_avgy > -- > > Key: SPARK-37614 > URL: https://issues.apache.org/jira/browse/SPARK-37614 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > REGR_AVGX and REGR_AVGY are ANSI aggregate functions. many database support > it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37614) Support ANSI Aggregate Function: regr_avgx & regr_avgy
[ https://issues.apache.org/jira/browse/SPARK-37614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37614: --- Assignee: jiaan.geng > Support ANSI Aggregate Function: regr_avgx & regr_avgy > -- > > Key: SPARK-37614 > URL: https://issues.apache.org/jira/browse/SPARK-37614 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > REGR_AVGX and REGR_AVGY are ANSI aggregate functions. many database support > it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode
[ https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38322: Assignee: (was: Apache Spark) > Support query stage show runtime statistics in formatted explain mode > - > > Key: SPARK-38322 > URL: https://issues.apache.org/jira/browse/SPARK-38322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > The formatted explalin mode is the powerful explain mode to show the details > of query plan. In AQE, the query stage know its statistics if has already > materialized. So it can help to quick check the conversion of plan, e.g. join > selection. > A simple example: > {code:java} > SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (21) > +- == Final Plan == >* SortMergeJoin Inner (13) >:- * Sort (6) >: +- AQEShuffleRead (5) >: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) >:+- Exchange (3) >: +- * Filter (2) >: +- Scan hive default.t (1) >+- * Sort (12) > +- AQEShuffleRead (11) > +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Exchange (9) >+- * Filter (8) > +- Scan hive default.t2 (7) > +- == Initial Plan == >SortMergeJoin Inner (20) >:- Sort (16) >: +- Exchange (15) >: +- Filter (14) >:+- Scan hive default.t (1) >+- Sort (19) > +- Exchange (18) > +- Filter (17) > +- Scan hive default.t2 (7){code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode
[ https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497899#comment-17497899 ] Apache Spark commented on SPARK-38322: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/35658 > Support query stage show runtime statistics in formatted explain mode > - > > Key: SPARK-38322 > URL: https://issues.apache.org/jira/browse/SPARK-38322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > The formatted explalin mode is the powerful explain mode to show the details > of query plan. In AQE, the query stage know its statistics if has already > materialized. So it can help to quick check the conversion of plan, e.g. join > selection. > A simple example: > {code:java} > SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (21) > +- == Final Plan == >* SortMergeJoin Inner (13) >:- * Sort (6) >: +- AQEShuffleRead (5) >: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) >:+- Exchange (3) >: +- * Filter (2) >: +- Scan hive default.t (1) >+- * Sort (12) > +- AQEShuffleRead (11) > +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Exchange (9) >+- * Filter (8) > +- Scan hive default.t2 (7) > +- == Initial Plan == >SortMergeJoin Inner (20) >:- Sort (16) >: +- Exchange (15) >: +- Filter (14) >:+- Scan hive default.t (1) >+- Sort (19) > +- Exchange (18) > +- Filter (17) > +- Scan hive default.t2 (7){code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode
[ https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38322: Assignee: Apache Spark > Support query stage show runtime statistics in formatted explain mode > - > > Key: SPARK-38322 > URL: https://issues.apache.org/jira/browse/SPARK-38322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > The formatted explalin mode is the powerful explain mode to show the details > of query plan. In AQE, the query stage know its statistics if has already > materialized. So it can help to quick check the conversion of plan, e.g. join > selection. > A simple example: > {code:java} > SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (21) > +- == Final Plan == >* SortMergeJoin Inner (13) >:- * Sort (6) >: +- AQEShuffleRead (5) >: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) >:+- Exchange (3) >: +- * Filter (2) >: +- Scan hive default.t (1) >+- * Sort (12) > +- AQEShuffleRead (11) > +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Exchange (9) >+- * Filter (8) > +- Scan hive default.t2 (7) > +- == Initial Plan == >SortMergeJoin Inner (20) >:- Sort (16) >: +- Exchange (15) >: +- Filter (14) >:+- Scan hive default.t (1) >+- Sort (19) > +- Exchange (18) > +- Filter (17) > +- Scan hive default.t2 (7){code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval
chong created SPARK-38324: - Summary: The second range is not [0, 59] in the day time ANSI interval Key: SPARK-38324 URL: https://issues.apache.org/jira/browse/SPARK-38324 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 3.3.0 Environment: Spark 3.3.0 snapshot Reporter: chong [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] * SECOND, seconds within minutes and possibly fractions of a second [0..59.99]{{{}{}}} {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}} But testing shows 99 second is valid: {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}} {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to second]{}}}{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38323) Support the hidden file metadata in Streaming
Yaohua Zhao created SPARK-38323: --- Summary: Support the hidden file metadata in Streaming Key: SPARK-38323 URL: https://issues.apache.org/jira/browse/SPARK-38323 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 3.2.1 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38298) Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, complexTypesSuite, CastSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-38298. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35618 [https://github.com/apache/spark/pull/35618] > Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, > complexTypesSuite, CastSuite under ANSI mode > --- > > Key: SPARK-38298 > URL: https://issues.apache.org/jira/browse/SPARK-38298 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Assignee: Xinyi Yu >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38298) Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, complexTypesSuite, CastSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-38298: -- Assignee: Xinyi Yu > Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, > complexTypesSuite, CastSuite under ANSI mode > --- > > Key: SPARK-38298 > URL: https://issues.apache.org/jira/browse/SPARK-38298 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Assignee: Xinyi Yu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode
[ https://issues.apache.org/jira/browse/SPARK-38322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-38322: -- Description: The formatted explalin mode is the powerful explain mode to show the details of query plan. In AQE, the query stage know its statistics if has already materialized. So it can help to quick check the conversion of plan, e.g. join selection. A simple example: {code:java} SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} {code:java} == Physical Plan == AdaptiveSparkPlan (21) +- == Final Plan == * SortMergeJoin Inner (13) :- * Sort (6) : +- AQEShuffleRead (5) : +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) :+- Exchange (3) : +- * Filter (2) : +- Scan hive default.t (1) +- * Sort (12) +- AQEShuffleRead (11) +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) +- Exchange (9) +- * Filter (8) +- Scan hive default.t2 (7) +- == Initial Plan == SortMergeJoin Inner (20) :- Sort (16) : +- Exchange (15) : +- Filter (14) :+- Scan hive default.t (1) +- Sort (19) +- Exchange (18) +- Filter (17) +- Scan hive default.t2 (7){code} was: The formatted explalin mode is the powerful explain mode to show the details of query plan. In AQE, the query stage know its statistics if has already materialized. So it can help to quick check the conversion of plan, e.g. join selection. A simple example: {code:java} SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} {code:java} == Physical Plan == AdaptiveSparkPlan (21) +- == Final Plan == * SortMergeJoin Inner (13) :- * Sort (6) : +- AQEShuffleRead (5) : +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) :+- Exchange (3) : +- * Filter (2) : +- Scan hive default.t (1) +- * Sort (12) +- AQEShuffleRead (11) +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) +- Exchange (9) +- * Filter (8) +- Scan hive default.t2 (7) +- == Initial Plan == SortMergeJoin Inner (20) :- Sort (16) : +- Exchange (15) : +- Filter (14) :+- Scan hive default.t (1) +- Sort (19) +- Exchange (18) +- Filter (17) +- Scan hive default.t2 (7){code} > Support query stage show runtime statistics in formatted explain mode > - > > Key: SPARK-38322 > URL: https://issues.apache.org/jira/browse/SPARK-38322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > The formatted explalin mode is the powerful explain mode to show the details > of query plan. In AQE, the query stage know its statistics if has already > materialized. So it can help to quick check the conversion of plan, e.g. join > selection. > A simple example: > {code:java} > SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (21) > +- == Final Plan == >* SortMergeJoin Inner (13) >:- * Sort (6) >: +- AQEShuffleRead (5) >: +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) >:+- Exchange (3) >: +- * Filter (2) >: +- Scan hive default.t (1) >+- * Sort (12) > +- AQEShuffleRead (11) > +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Exchange (9) >+- * Filter (8) > +- Scan hive default.t2 (7) > +- == Initial Plan == >SortMergeJoin Inner (20) >:- Sort (16) >: +- Exchange (15) >: +- Filter (14) >:+- Scan hive default.t (1) >+- Sort (19) > +- Exchange (18) > +- Filter (17) > +- Scan hive default.t2 (7){code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38322) Support query stage show runtime statistics in formatted explain mode
XiDuo You created SPARK-38322: - Summary: Support query stage show runtime statistics in formatted explain mode Key: SPARK-38322 URL: https://issues.apache.org/jira/browse/SPARK-38322 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You The formatted explalin mode is the powerful explain mode to show the details of query plan. In AQE, the query stage know its statistics if has already materialized. So it can help to quick check the conversion of plan, e.g. join selection. A simple example: {code:java} SELECT * FROM t JOIN t2 ON t.c = t2.c;{code} {code:java} == Physical Plan == AdaptiveSparkPlan (21) +- == Final Plan == * SortMergeJoin Inner (13) :- * Sort (6) : +- AQEShuffleRead (5) : +- ShuffleQueryStage (4), Statistics(sizeInBytes=16.0 B, rowCount=1) :+- Exchange (3) : +- * Filter (2) : +- Scan hive default.t (1) +- * Sort (12) +- AQEShuffleRead (11) +- ShuffleQueryStage (10), Statistics(sizeInBytes=16.0 B, rowCount=1) +- Exchange (9) +- * Filter (8) +- Scan hive default.t2 (7) +- == Initial Plan == SortMergeJoin Inner (20) :- Sort (16) : +- Exchange (15) : +- Filter (14) :+- Scan hive default.t (1) +- Sort (19) +- Exchange (18) +- Filter (17) +- Scan hive default.t2 (7){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38311) Fix DynamicPartitionPruning/BucketedReadSuite/ExpressionInfoSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-38311. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35644 [https://github.com/apache/spark/pull/35644] > Fix DynamicPartitionPruning/BucketedReadSuite/ExpressionInfoSuite under ANSI > mode > - > > Key: SPARK-38311 > URL: https://issues.apache.org/jira/browse/SPARK-38311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38275) Consider to include WriteBatch's memory in the memory usage of RocksDB state store
[ https://issues.apache.org/jira/browse/SPARK-38275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-38275. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35600 [https://github.com/apache/spark/pull/35600] > Consider to include WriteBatch's memory in the memory usage of RocksDB state > store > -- > > Key: SPARK-38275 > URL: https://issues.apache.org/jira/browse/SPARK-38275 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Yun Tang >Assignee: Yun Tang >Priority: Major > Fix For: 3.3.0 > > > Current RocksDB state store actually use a unlimited {{WriteBatch}} with a > DB, the {{WriteBatch}} would not be cleared until the micro-batch data > committed, which results that the memoy usage of {{WriteBatch}} could be very > huge. > We should consider to add the approximate memory usgae of WriteBatch as the > totdal memory usage and also print it separately. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38275) Consider to include WriteBatch's memory in the memory usage of RocksDB state store
[ https://issues.apache.org/jira/browse/SPARK-38275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-38275: Assignee: Yun Tang > Consider to include WriteBatch's memory in the memory usage of RocksDB state > store > -- > > Key: SPARK-38275 > URL: https://issues.apache.org/jira/browse/SPARK-38275 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Yun Tang >Assignee: Yun Tang >Priority: Major > > Current RocksDB state store actually use a unlimited {{WriteBatch}} with a > DB, the {{WriteBatch}} would not be cleared until the micro-batch data > committed, which results that the memoy usage of {{WriteBatch}} could be very > huge. > We should consider to add the approximate memory usgae of WriteBatch as the > totdal memory usage and also print it separately. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38172) Adaptive coalesce not working with df persist
[ https://issues.apache.org/jira/browse/SPARK-38172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497870#comment-17497870 ] XiDuo You commented on SPARK-38172: --- thanks [~Naveenmts] for the confirming ! > Adaptive coalesce not working with df persist > - > > Key: SPARK-38172 > URL: https://issues.apache.org/jira/browse/SPARK-38172 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: OS: Linux > Spark Version: 3.2.3 >Reporter: Naveen Nagaraj >Priority: Major > Attachments: image-2022-02-10-15-32-30-355.png, > image-2022-02-10-15-33-08-018.png, image-2022-02-10-15-33-32-607.png > > > {code:java} > // code placeholder > val spark = SparkSession.builder().master("local[4]").appName("Test") > .config("spark.sql.adaptive.enabled", "true") > > .config("spark.sql.adaptive.coalescePartitions.enabled", "true") > > .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m") > > .config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1") > > .config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024") > .getOrCreate() > val df = spark.read.csv("") > val df1 = df.distinct() > df1.persist() // On removing this line. Code works as expected > df1.write.csv("") {code} > Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each > which is expected > [https://i.stack.imgur.com/tDxpV.png] > If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce > not working) With persist > [https://i.stack.imgur.com/W13hA.png] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38172) Adaptive coalesce not working with df persist
[ https://issues.apache.org/jira/browse/SPARK-38172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-38172. --- Resolution: Won't Fix > Adaptive coalesce not working with df persist > - > > Key: SPARK-38172 > URL: https://issues.apache.org/jira/browse/SPARK-38172 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: OS: Linux > Spark Version: 3.2.3 >Reporter: Naveen Nagaraj >Priority: Major > Attachments: image-2022-02-10-15-32-30-355.png, > image-2022-02-10-15-33-08-018.png, image-2022-02-10-15-33-32-607.png > > > {code:java} > // code placeholder > val spark = SparkSession.builder().master("local[4]").appName("Test") > .config("spark.sql.adaptive.enabled", "true") > > .config("spark.sql.adaptive.coalescePartitions.enabled", "true") > > .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m") > > .config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1") > > .config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024") > .getOrCreate() > val df = spark.read.csv("") > val df1 = df.distinct() > df1.persist() // On removing this line. Code works as expected > df1.write.csv("") {code} > Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each > which is expected > [https://i.stack.imgur.com/tDxpV.png] > If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce > not working) With persist > [https://i.stack.imgur.com/W13hA.png] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38303) Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev
[ https://issues.apache.org/jira/browse/SPARK-38303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-38303. Fix Version/s: 3.3.0 3.2.2 Assignee: Bjørn Jørgensen Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/35628 > Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev > -- > > Key: SPARK-38303 > URL: https://issues.apache.org/jira/browse/SPARK-38303 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > [CVE-2021-3807|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3807] > > [releases notes at github|https://github.com/chalk/ansi-regex/releases] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38303) Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev
[ https://issues.apache.org/jira/browse/SPARK-38303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-38303: --- Affects Version/s: 3.2.1 > Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev > -- > > Key: SPARK-38303 > URL: https://issues.apache.org/jira/browse/SPARK-38303 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2021-3807|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3807] > > [releases notes at github|https://github.com/chalk/ansi-regex/releases] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz
[ https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497833#comment-17497833 ] qian commented on SPARK-38302: -- [~dongjoon] Thanks for your work :) > Use Java 17 in K8S integration tests when setting spark-tgz > --- > > Key: SPARK-38302 > URL: https://issues.apache.org/jira/browse/SPARK-38302 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Minor > > When setting parameters `spark-tgz` during integration tests, the error that > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > cannot be found occurs. This is due to the default value of > `spark.kubernetes.test.dockerFile` being > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`. > When using the tgz, the working directory is > `${spark.kubernetes.test.unpackSparkDir}`, and the relative path > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > is invalid. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38191) The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.
[ https://issues.apache.org/jira/browse/SPARK-38191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38191. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35492 [https://github.com/apache/spark/pull/35492] > The staging directory of write job only needs to be initialized once in > HadoopMapReduceCommitProtocol. > -- > > Key: SPARK-38191 > URL: https://issues.apache.org/jira/browse/SPARK-38191 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: weixiuli >Assignee: weixiuli >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38191) The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.
[ https://issues.apache.org/jira/browse/SPARK-38191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-38191: Assignee: weixiuli > The staging directory of write job only needs to be initialized once in > HadoopMapReduceCommitProtocol. > -- > > Key: SPARK-38191 > URL: https://issues.apache.org/jira/browse/SPARK-38191 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: weixiuli >Assignee: weixiuli >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow
[ https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497826#comment-17497826 ] L. C. Hsieh commented on SPARK-38285: - Thanks for reporting this. I will take a look. > ClassCastException: GenericArrayData cannot be cast to InternalRow > -- > > Key: SPARK-38285 > URL: https://issues.apache.org/jira/browse/SPARK-38285 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Alessandro Bacchini >Priority: Major > > The following code with Spark 3.2.1 raises an exception: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([ > StructField('o', > ArrayType( > StructType([ > StructField('s', StringType(), False), > StructField('b', ArrayType( > StructType([ > StructField('e', StringType(), False) > ]), > True), > False) > ]), > True), > False)]) > value = { > "o": [ > { > "s": "string1", > "b": [ > { > "e": "string2" > }, > { > "e": "string3" > } > ] > }, > { > "s": "string4", > "b": [ > { > "e": "string5" > }, > { > "e": "string6" > }, > { > "e": "string7" > } > ] > } > ] > } > df = ( > spark.createDataFrame([value], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.b.e") > ) > df.show() > {code} > The exception message is: > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) > at > org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) > at > org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at > org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) > at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153) > at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.scheduler.Task.run(Task.scala:93) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected. > Please note that the issue seems to be related to SPARK-37577: I am using the > same DataFrame schema, but this time I have populated it with non empty value. > I think that this is bug because with the following configuration it works as > expected: > {code:python} > spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) >
[jira] [Assigned] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37377: Assignee: (was: Apache Spark) > Refactor V2 Partitioning interface and remove deprecated usage of Distribution > -- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently {{Partitioning}} is defined as follow: > {code:scala} > @Evolving > public interface Partitioning { > int numPartitions(); > boolean satisfy(Distribution distribution); > } > {code} > There are two issues with the interface: 1) it uses a deprecated > {{Distribution}} interface, and should switch to > {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently > there is no way to use this in join where we want to compare reported > partitionings from both sides and decide whether they are "compatible" (and > thus allows Spark to eliminate shuffle). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37377: Assignee: Apache Spark > Refactor V2 Partitioning interface and remove deprecated usage of Distribution > -- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Currently {{Partitioning}} is defined as follow: > {code:scala} > @Evolving > public interface Partitioning { > int numPartitions(); > boolean satisfy(Distribution distribution); > } > {code} > There are two issues with the interface: 1) it uses a deprecated > {{Distribution}} interface, and should switch to > {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently > there is no way to use this in join where we want to compare reported > partitionings from both sides and decide whether they are "compatible" (and > thus allows Spark to eliminate shuffle). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37377) Refactor V2 Partitioning interface and remove deprecated usage of Distribution
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497819#comment-17497819 ] Apache Spark commented on SPARK-37377: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/35657 > Refactor V2 Partitioning interface and remove deprecated usage of Distribution > -- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently {{Partitioning}} is defined as follow: > {code:scala} > @Evolving > public interface Partitioning { > int numPartitions(); > boolean satisfy(Distribution distribution); > } > {code} > There are two issues with the interface: 1) it uses a deprecated > {{Distribution}} interface, and should switch to > {{org.apache.spark.sql.connector.distributions.Distribution}}. 2) currently > there is no way to use this in join where we want to compare reported > partitionings from both sides and decide whether they are "compatible" (and > thus allows Spark to eliminate shuffle). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38107: Assignee: (was: Apache Spark) > Use error classes in the compilation errors of python/pandas UDFs > - > > Key: SPARK-38107 > URL: https://issues.apache.org/jira/browse/SPARK-38107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * pandasUDFAggregateNotSupportedInPivotError > * groupAggPandasUDFUnsupportedByStreamingAggError > * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError > * usePythonUDFInJoinConditionUnsupportedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38107: Assignee: Apache Spark > Use error classes in the compilation errors of python/pandas UDFs > - > > Key: SPARK-38107 > URL: https://issues.apache.org/jira/browse/SPARK-38107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * pandasUDFAggregateNotSupportedInPivotError > * groupAggPandasUDFUnsupportedByStreamingAggError > * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError > * usePythonUDFInJoinConditionUnsupportedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38107) Use error classes in the compilation errors of python/pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-38107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497799#comment-17497799 ] Apache Spark commented on SPARK-38107: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/35656 > Use error classes in the compilation errors of python/pandas UDFs > - > > Key: SPARK-38107 > URL: https://issues.apache.org/jira/browse/SPARK-38107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * pandasUDFAggregateNotSupportedInPivotError > * groupAggPandasUDFUnsupportedByStreamingAggError > * cannotUseMixtureOfAggFunctionAndGroupAggPandasUDFError > * usePythonUDFInJoinConditionUnsupportedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes
[ https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497740#comment-17497740 ] Apache Spark commented on SPARK-38315: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35655 > Add a config to control decoding of datetime as Java 8 classes > -- > > Key: SPARK-38315 > URL: https://issues.apache.org/jira/browse/SPARK-38315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add new config that should control collect(), in particular, and allow to > enable/disable to Java 8 types in the Thrift server. The config should solve > the following issue: > When an user connects to the Thrift Server and a query involve a datasource > connect which doesn't handle Java8 types, the user observes the following > exception: > {code:java} > ERROR SparkExecuteStatementOperation: Error executing query with > ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while > encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid > external type for schema of timestamp > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > TimestampType, instantToMicros, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, > false) AS loan_perf_date#1125 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) > > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes
[ https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497738#comment-17497738 ] Apache Spark commented on SPARK-38315: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35655 > Add a config to control decoding of datetime as Java 8 classes > -- > > Key: SPARK-38315 > URL: https://issues.apache.org/jira/browse/SPARK-38315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add new config that should control collect(), in particular, and allow to > enable/disable to Java 8 types in the Thrift server. The config should solve > the following issue: > When an user connects to the Thrift Server and a query involve a datasource > connect which doesn't handle Java8 types, the user observes the following > exception: > {code:java} > ERROR SparkExecuteStatementOperation: Error executing query with > ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while > encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid > external type for schema of timestamp > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > TimestampType, instantToMicros, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, > false) AS loan_perf_date#1125 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) > > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes
[ https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38315: Assignee: Max Gekk (was: Apache Spark) > Add a config to control decoding of datetime as Java 8 classes > -- > > Key: SPARK-38315 > URL: https://issues.apache.org/jira/browse/SPARK-38315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add new config that should control collect(), in particular, and allow to > enable/disable to Java 8 types in the Thrift server. The config should solve > the following issue: > When an user connects to the Thrift Server and a query involve a datasource > connect which doesn't handle Java8 types, the user observes the following > exception: > {code:java} > ERROR SparkExecuteStatementOperation: Error executing query with > ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while > encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid > external type for schema of timestamp > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > TimestampType, instantToMicros, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, > false) AS loan_perf_date#1125 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) > > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes
[ https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38315: Assignee: Apache Spark (was: Max Gekk) > Add a config to control decoding of datetime as Java 8 classes > -- > > Key: SPARK-38315 > URL: https://issues.apache.org/jira/browse/SPARK-38315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Add new config that should control collect(), in particular, and allow to > enable/disable to Java 8 types in the Thrift server. The config should solve > the following issue: > When an user connects to the Thrift Server and a query involve a datasource > connect which doesn't handle Java8 types, the user observes the following > exception: > {code:java} > ERROR SparkExecuteStatementOperation: Error executing query with > ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while > encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid > external type for schema of timestamp > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > TimestampType, instantToMicros, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, > false) AS loan_perf_date#1125 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) > > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes
[ https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38315: - Summary: Add a config to control decoding of datetime as Java 8 classes (was: Add a config to collect objects as Java 8 types in the Thrift server) > Add a config to control decoding of datetime as Java 8 classes > -- > > Key: SPARK-38315 > URL: https://issues.apache.org/jira/browse/SPARK-38315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add new config that should control collect(), and allow to enable/disable to > Java 8 types in the Thrift server. The config should solve the following > issue: > When an user connects to the Thrift Server and a query involve a datasource > connect which doesn't handle Java8 types, the user observes the following > exception: > {code:java} > ERROR SparkExecuteStatementOperation: Error executing query with > ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while > encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid > external type for schema of timestamp > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > TimestampType, instantToMicros, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, > false) AS loan_perf_date#1125 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) > > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38315) Add a config to control decoding of datetime as Java 8 classes
[ https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38315: - Description: Add new config that should control collect(), in particular, and allow to enable/disable to Java 8 types in the Thrift server. The config should solve the following issue: When an user connects to the Thrift Server and a query involve a datasource connect which doesn't handle Java8 types, the user observes the following exception: {code:java} ERROR SparkExecuteStatementOperation: Error executing query with ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for schema of timestamp if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, instantToMicros, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, false) AS loan_perf_date#1125 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) {code} was: Add new config that should control collect(), and allow to enable/disable to Java 8 types in the Thrift server. The config should solve the following issue: When an user connects to the Thrift Server and a query involve a datasource connect which doesn't handle Java8 types, the user observes the following exception: {code:java} ERROR SparkExecuteStatementOperation: Error executing query with ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for schema of timestamp if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, instantToMicros, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, false) AS loan_perf_date#1125 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) {code} > Add a config to control decoding of datetime as Java 8 classes > -- > > Key: SPARK-38315 > URL: https://issues.apache.org/jira/browse/SPARK-38315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add new config that should control collect(), in particular, and allow to > enable/disable to Java 8 types in the Thrift server. The config should solve > the following issue: > When an user connects to the Thrift Server and a query involve a datasource > connect which doesn't handle Java8 types, the user observes the following > exception: > {code:java} > ERROR SparkExecuteStatementOperation: Error executing query with > ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while > encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid > external type for schema of timestamp > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else
[jira] [Assigned] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz
[ https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38302: - Assignee: qian > Use Java 17 in K8S integration tests when setting spark-tgz > --- > > Key: SPARK-38302 > URL: https://issues.apache.org/jira/browse/SPARK-38302 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Minor > > When setting parameters `spark-tgz` during integration tests, the error that > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > cannot be found occurs. This is due to the default value of > `spark.kubernetes.test.dockerFile` being > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`. > When using the tgz, the working directory is > `${spark.kubernetes.test.unpackSparkDir}`, and the relative path > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > is invalid. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz
[ https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38302: -- Summary: Use Java 17 in K8S integration tests when setting spark-tgz (was: Dockerfile.java17 can't be used in K8s integration tests when ) > Use Java 17 in K8S integration tests when setting spark-tgz > --- > > Key: SPARK-38302 > URL: https://issues.apache.org/jira/browse/SPARK-38302 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Priority: Minor > > When setting parameters `spark-tgz` during integration tests, the error that > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > cannot be found occurs. This is due to the default value of > `spark.kubernetes.test.dockerFile` being > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`. > When using the tgz, the working directory is > `${spark.kubernetes.test.unpackSparkDir}`, and the relative path > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > is invalid. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38302) Dockerfile.java17 can't be used in K8s integration tests when
[ https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497676#comment-17497676 ] Dongjoon Hyun commented on SPARK-38302: --- I collected this to a subtask of SPARK-33772 to give more visibility to your issue, [~dcoliversun]. > Dockerfile.java17 can't be used in K8s integration tests when > -- > > Key: SPARK-38302 > URL: https://issues.apache.org/jira/browse/SPARK-38302 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Priority: Minor > > When setting parameters `spark-tgz` during integration tests, the error that > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > cannot be found occurs. This is due to the default value of > `spark.kubernetes.test.dockerFile` being > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`. > When using the tgz, the working directory is > `${spark.kubernetes.test.unpackSparkDir}`, and the relative path > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > is invalid. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38302) Dockerfile.java17 can't be used in K8s integration tests when
[ https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38302: -- Parent: SPARK-33772 Issue Type: Sub-task (was: Improvement) > Dockerfile.java17 can't be used in K8s integration tests when > -- > > Key: SPARK-38302 > URL: https://issues.apache.org/jira/browse/SPARK-38302 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Priority: Minor > > When setting parameters `spark-tgz` during integration tests, the error that > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > cannot be found occurs. This is due to the default value of > `spark.kubernetes.test.dockerFile` being > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`. > When using the tgz, the working directory is > `${spark.kubernetes.test.unpackSparkDir}`, and the relative path > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > is invalid. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow
[ https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497662#comment-17497662 ] Bruce Robbins edited comment on SPARK-38285 at 2/24/22, 7:19 PM: - I see your point. It appears to be caused by [this commit|https://github.com/apache/spark/commit/c59988aa79] (for SPARK-34638). cc [~viirya] Before that commit, this works: {noformat} create or replace temp view v1 as select * from values (array( named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), named_struct('e', 'string3'))), named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), named_struct('e', 'string6'))) ) ) v1(o); select eo.b.e from (select explode(o) as eo from v1); {noformat} It produces: {noformat} ["string2","string3"] ["string5","string6"] {noformat} After that commit, you instead get the following error: {noformat} java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow {noformat} You can bypass the error by caching the {{{}explode{}}}. For example, this works even after SPARK-34638: {noformat} create or replace temp view v1 as select * from values (array( named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), named_struct('e', 'string3'))), named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), named_struct('e', 'string6'))) ) ) v1(o); create or replace temporary view v2 as select explode(o) as eo from v1; cache table v2; select eo.b.e from v2; {noformat} Also you can bypass the error by turning off {{spark.sql.optimizer.expression.nestedPruning.enabled}} and {{{}spark.sql.optimizer.nestedSchemaPruning.enabled{}}}, as [~allebacco] mentioned above. was (Author: bersprockets): I see your point. It appears to be caused by [this commit|https://github.com/apache/spark/commit/c59988aa79] (for SPARK-34638). cc [~viirya] Before that commit, this works: {noformat} create or replace temp view v1 as select * from values (array( named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), named_struct('e', 'string3'))), named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), named_struct('e', 'string6'))) ) ) v1(o); select eo.b.e from (select explode(o) as eo from v1); {noformat} It produces: {noformat} ["string2","string3"] ["string5","string6"] {noformat} On or after that commit, you instead get the following error: {noformat} java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow {noformat} You can bypass the error by caching the {{explode}}. For example, this works even after SPARK-34638: {noformat} create or replace temp view v1 as select * from values (array( named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), named_struct('e', 'string3'))), named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), named_struct('e', 'string6'))) ) ) v1(o); create or replace temporary view v2 as select explode(o) as eo from v1; cache table v2; select eo.b.e from v2; {noformat} Also you can bypass the error by turning off {{spark.sql.optimizer.expression.nestedPruning.enabled}} and {{spark.sql.optimizer.nestedSchemaPruning.enabled}}, as [~allebacco] mentioned above. > ClassCastException: GenericArrayData cannot be cast to InternalRow > -- > > Key: SPARK-38285 > URL: https://issues.apache.org/jira/browse/SPARK-38285 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Alessandro Bacchini >Priority: Major > > The following code with Spark 3.2.1 raises an exception: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([ > StructField('o', > ArrayType( > StructType([ > StructField('s', StringType(), False), > StructField('b', ArrayType( > StructType([ > StructField('e', StringType(), False) > ]), > True), > False) > ]), > True), > False)]) > value = { > "o": [ > { > "s": "string1", > "b": [ > { > "e": "string2" > }, > { > "e": "string3" > } > ] > }, > { > "s": "string4", > "b": [ > { > "e": "string5" > }, > { > "e": "string6" > }, >
[jira] [Commented] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497673#comment-17497673 ] Apache Spark commented on SPARK-38321: -- User 'anchovYu' has created a pull request for this issue: https://github.com/apache/spark/pull/35654 > Fix BooleanSimplificationSuite under ANSI mode > -- > > Key: SPARK-38321 > URL: https://issues.apache.org/jira/browse/SPARK-38321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38321: Assignee: (was: Apache Spark) > Fix BooleanSimplificationSuite under ANSI mode > -- > > Key: SPARK-38321 > URL: https://issues.apache.org/jira/browse/SPARK-38321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38321: Assignee: Apache Spark > Fix BooleanSimplificationSuite under ANSI mode > -- > > Key: SPARK-38321 > URL: https://issues.apache.org/jira/browse/SPARK-38321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497672#comment-17497672 ] Apache Spark commented on SPARK-38321: -- User 'anchovYu' has created a pull request for this issue: https://github.com/apache/spark/pull/35654 > Fix BooleanSimplificationSuite under ANSI mode > -- > > Key: SPARK-38321 > URL: https://issues.apache.org/jira/browse/SPARK-38321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38321) Fix BooleanSimplificationSuite under ANSI mode
Xinyi Yu created SPARK-38321: Summary: Fix BooleanSimplificationSuite under ANSI mode Key: SPARK-38321 URL: https://issues.apache.org/jira/browse/SPARK-38321 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Xinyi Yu -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow
[ https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497662#comment-17497662 ] Bruce Robbins commented on SPARK-38285: --- I see your point. It appears to be caused by [this commit|https://github.com/apache/spark/commit/c59988aa79] (for SPARK-34638). cc [~viirya] Before that commit, this works: {noformat} create or replace temp view v1 as select * from values (array( named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), named_struct('e', 'string3'))), named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), named_struct('e', 'string6'))) ) ) v1(o); select eo.b.e from (select explode(o) as eo from v1); {noformat} It produces: {noformat} ["string2","string3"] ["string5","string6"] {noformat} On or after that commit, you instead get the following error: {noformat} java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow {noformat} You can bypass the error by caching the {{explode}}. For example, this works even after SPARK-34638: {noformat} create or replace temp view v1 as select * from values (array( named_struct('s', 'string1', 'b', array(named_struct('e', 'string2'), named_struct('e', 'string3'))), named_struct('s', 'string4', 'b', array(named_struct('e', 'string5'), named_struct('e', 'string6'))) ) ) v1(o); create or replace temporary view v2 as select explode(o) as eo from v1; cache table v2; select eo.b.e from v2; {noformat} Also you can bypass the error by turning off {{spark.sql.optimizer.expression.nestedPruning.enabled}} and {{spark.sql.optimizer.nestedSchemaPruning.enabled}}, as [~allebacco] mentioned above. > ClassCastException: GenericArrayData cannot be cast to InternalRow > -- > > Key: SPARK-38285 > URL: https://issues.apache.org/jira/browse/SPARK-38285 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Alessandro Bacchini >Priority: Major > > The following code with Spark 3.2.1 raises an exception: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([ > StructField('o', > ArrayType( > StructType([ > StructField('s', StringType(), False), > StructField('b', ArrayType( > StructType([ > StructField('e', StringType(), False) > ]), > True), > False) > ]), > True), > False)]) > value = { > "o": [ > { > "s": "string1", > "b": [ > { > "e": "string2" > }, > { > "e": "string3" > } > ] > }, > { > "s": "string4", > "b": [ > { > "e": "string5" > }, > { > "e": "string6" > }, > { > "e": "string7" > } > ] > } > ] > } > df = ( > spark.createDataFrame([value], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.b.e") > ) > df.show() > {code} > The exception message is: > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) > at > org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) > at > org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at > org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) > at
[jira] [Commented] (SPARK-38318) regression when replacing a dataset view
[ https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497602#comment-17497602 ] Apache Spark commented on SPARK-38318: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/35653 > regression when replacing a dataset view > > > Key: SPARK-38318 > URL: https://issues.apache.org/jira/browse/SPARK-38318 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Linhong Liu >Priority: Major > > The below use case works well in 3.1 but failed in 3.2 and master. > {code:java} > sql("select 1").createOrReplaceTempView("v") > sql("select * from v").createOrReplaceTempView("v") > // in 3.1 it works well, and select will output 1 > // in 3.2 it failed with error: "AnalysisException: Recursive view v detected > (cycle: v -> v)"{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38320) (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch
[ https://issues.apache.org/jira/browse/SPARK-38320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Balikov updated SPARK-38320: - Description: We have identified an issue where the RocksDB state store iterator will not pick up store updates made after its creation. As a result of this, the _timeoutProcessorIter_ in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala] will not pick up state changes made during _newDataProcessorIter_ input processing. The user observed behavior is that a group state may receive input records and also be called with timeout in the same micro batch. This contradics the public documentation for GroupState - [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html] * The timeout is reset every time the function is called on a group, that is, when the group has new data, or the group has timed out. So the user has to set the timeout duration every time the function is called, otherwise, there will not be any timeout set. was: We have identified an issue where the RocksDB state store iterator will not pick up store updates made after its creation. As a result of this, the _timeoutProcessorIter_ in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala] will not pick up state changes made during newDataProcessorIter input processing. The user observed behavior is that a group state may receive input records and also be called with timeout in the same micro batch. This contradics the public documentation for GroupState - https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html * The timeout is reset every time the function is called on a group, that is, when the group has new data, or the group has timed out. So the user has to set the timeout duration every time the function is called, otherwise, there will not be any timeout set. > (flat)MapGroupsWithState can timeout groups which just received inputs in the > same microbatch > - > > Key: SPARK-38320 > URL: https://issues.apache.org/jira/browse/SPARK-38320 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Alex Balikov >Priority: Major > > We have identified an issue where the RocksDB state store iterator will not > pick up store updates made after its creation. As a result of this, the > _timeoutProcessorIter_ in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala] > will not pick up state changes made during _newDataProcessorIter_ input > processing. The user observed behavior is that a group state may receive > input records and also be called with timeout in the same micro batch. This > contradics the public documentation for GroupState - > [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html] > * The timeout is reset every time the function is called on a group, that > is, when the group has new data, or the group has timed out. So the user has > to set the timeout duration every time the function is called, otherwise, > there will not be any timeout set. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38320) (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch
Alex Balikov created SPARK-38320: Summary: (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch Key: SPARK-38320 URL: https://issues.apache.org/jira/browse/SPARK-38320 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.2.1 Reporter: Alex Balikov We have identified an issue where the RocksDB state store iterator will not pick up store updates made after its creation. As a result of this, the _timeoutProcessorIter_ in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala] will not pick up state changes made during newDataProcessorIter input processing. The user observed behavior is that a group state may receive input records and also be called with timeout in the same micro batch. This contradics the public documentation for GroupState - https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html * The timeout is reset every time the function is called on a group, that is, when the group has new data, or the group has timed out. So the user has to set the timeout duration every time the function is called, otherwise, there will not be any timeout set. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38318) regression when replacing a dataset view
[ https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38318: Assignee: (was: Apache Spark) > regression when replacing a dataset view > > > Key: SPARK-38318 > URL: https://issues.apache.org/jira/browse/SPARK-38318 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Linhong Liu >Priority: Major > > The below use case works well in 3.1 but failed in 3.2 and master. > {code:java} > sql("select 1").createOrReplaceTempView("v") > sql("select * from v").createOrReplaceTempView("v") > // in 3.1 it works well, and select will output 1 > // in 3.2 it failed with error: "AnalysisException: Recursive view v detected > (cycle: v -> v)"{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38318) regression when replacing a dataset view
[ https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38318: Assignee: Apache Spark > regression when replacing a dataset view > > > Key: SPARK-38318 > URL: https://issues.apache.org/jira/browse/SPARK-38318 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > > The below use case works well in 3.1 but failed in 3.2 and master. > {code:java} > sql("select 1").createOrReplaceTempView("v") > sql("select * from v").createOrReplaceTempView("v") > // in 3.1 it works well, and select will output 1 > // in 3.2 it failed with error: "AnalysisException: Recursive view v detected > (cycle: v -> v)"{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38318) regression when replacing a dataset view
[ https://issues.apache.org/jira/browse/SPARK-38318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497601#comment-17497601 ] Apache Spark commented on SPARK-38318: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/35653 > regression when replacing a dataset view > > > Key: SPARK-38318 > URL: https://issues.apache.org/jira/browse/SPARK-38318 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Linhong Liu >Priority: Major > > The below use case works well in 3.1 but failed in 3.2 and master. > {code:java} > sql("select 1").createOrReplaceTempView("v") > sql("select * from v").createOrReplaceTempView("v") > // in 3.1 it works well, and select will output 1 > // in 3.2 it failed with error: "AnalysisException: Recursive view v detected > (cycle: v -> v)"{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS
[ https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497598#comment-17497598 ] Erik Krogen commented on SPARK-37318: - Great point [~dongjoon], thanks for pointing it out! > Make FallbackStorageSuite robust in terms of DNS > > > Key: SPARK-37318 > URL: https://issues.apache.org/jira/browse/SPARK-37318 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > Usually, the test case expects the hostname doesn't exist. > {code} > $ ping remote > ping: cannot resolve remote: Unknown host > {code} > In some DNS environments, it returns always. > {code} > $ ping remote > PING remote (23.217.138.110): 56 data bytes > 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38319) Implement Strict Mode to prevent QUERY the entire table
dimtiris kanoute created SPARK-38319: Summary: Implement Strict Mode to prevent QUERY the entire table Key: SPARK-38319 URL: https://issues.apache.org/jira/browse/SPARK-38319 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.2.1 Reporter: dimtiris kanoute We are using Spark Thrift Server as a service to run Spark SQL queries along with Hive metastore as the metadata service. We would like to restrict users from querying the entire table and force them to use {{WHERE }}clause in the query based on partition column{{ (i.e. SELECT * FROM TABLE WHERE partition_column=) }}*and* {{LIMIT }}the output of the query when {{ORDER}} {{BY}} is used. This behaviour is similar to what hive exposes as configuration {{??hive.strict.checks.no.partition.filter??}} {{??hive.strict.checks.orderby.no.limit??}} and is described here: [https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1812|http://example.com/] and [https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1816|http://example.com/] This is a pretty common usecase / feature that we meet in other tools as well, like in BigQuery for example: [https://cloud.google.com/bigquery/docs/querying-partitioned-tables#require_a_partition_filter_in_queries] . It would be nice to have this feature implemented in Spark when hive support is enabled in a spark session. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS
[ https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497578#comment-17497578 ] Dongjoon Hyun commented on SPARK-37318: --- For a record, - Apache Spark 3.2.1 ~ 3.2.x has this test case fix. - For Apache Spark 3.3, SPARK-38062 improvement patch removes the restriction and reverted this test code change of SPARK-37318 logically. > Make FallbackStorageSuite robust in terms of DNS > > > Key: SPARK-37318 > URL: https://issues.apache.org/jira/browse/SPARK-37318 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > Usually, the test case expects the hostname doesn't exist. > {code} > $ ping remote > ping: cannot resolve remote: Unknown host > {code} > In some DNS environments, it returns always. > {code} > $ ping remote > PING remote (23.217.138.110): 56 data bytes > 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS
[ https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497576#comment-17497576 ] Dongjoon Hyun commented on SPARK-37318: --- It's wrong in branch-3.2, [~xkrogen]. Please be careful about the Affected Versions. bq. Note that the changes in this PR were reverted in SPARK-38062, in favor of a solution which fixes the production code rather than disabling the test case in certain environments. > Make FallbackStorageSuite robust in terms of DNS > > > Key: SPARK-37318 > URL: https://issues.apache.org/jira/browse/SPARK-37318 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > Usually, the test case expects the hostname doesn't exist. > {code} > $ ping remote > ping: cannot resolve remote: Unknown host > {code} > In some DNS environments, it returns always. > {code} > $ ping remote > PING remote (23.217.138.110): 56 data bytes > 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38318) regression when replacing a dataset view
Linhong Liu created SPARK-38318: --- Summary: regression when replacing a dataset view Key: SPARK-38318 URL: https://issues.apache.org/jira/browse/SPARK-38318 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.1, 3.2.0, 3.3.0 Reporter: Linhong Liu The below use case works well in 3.1 but failed in 3.2 and master. {code:java} sql("select 1").createOrReplaceTempView("v") sql("select * from v").createOrReplaceTempView("v") // in 3.1 it works well, and select will output 1 // in 3.2 it failed with error: "AnalysisException: Recursive view v detected (cycle: v -> v)"{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38273) Native memory leak in SparkPlan
[ https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38273. --- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35613 [https://github.com/apache/spark/pull/35613] > Native memory leak in SparkPlan > --- > > Key: SPARK-38273 > URL: https://issues.apache.org/jira/browse/SPARK-38273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Kevin Sewell >Assignee: Kevin Sewell >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. > This meant that all usages of `CompressionCodec.compressedInputStream` would > need to manually close the stream as this would no longer be handled by GC > finaliser mechanism. > In SparkPlan, the result of `CompressionCodec.compressedInputStream` is > wrapped in an Iterator which never calls close. This implementation needs to > make use of NextIterator which allows for the closing of underlying streams. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38273) decodeUnsafeRows's iterators should close underl… …ying input streams
[ https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38273: -- Summary: decodeUnsafeRows's iterators should close underl… …ying input streams (was: Native memory leak in SparkPlan) > decodeUnsafeRows's iterators should close underl… …ying input streams > - > > Key: SPARK-38273 > URL: https://issues.apache.org/jira/browse/SPARK-38273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Kevin Sewell >Assignee: Kevin Sewell >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. > This meant that all usages of `CompressionCodec.compressedInputStream` would > need to manually close the stream as this would no longer be handled by GC > finaliser mechanism. > In SparkPlan, the result of `CompressionCodec.compressedInputStream` is > wrapped in an Iterator which never calls close. This implementation needs to > make use of NextIterator which allows for the closing of underlying streams. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38273) decodeUnsafeRows's iterators should close underlying input streams
[ https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38273: -- Summary: decodeUnsafeRows's iterators should close underlying input streams (was: decodeUnsafeRows's iterators should close underl… …ying input streams) > decodeUnsafeRows's iterators should close underlying input streams > -- > > Key: SPARK-38273 > URL: https://issues.apache.org/jira/browse/SPARK-38273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Kevin Sewell >Assignee: Kevin Sewell >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. > This meant that all usages of `CompressionCodec.compressedInputStream` would > need to manually close the stream as this would no longer be handled by GC > finaliser mechanism. > In SparkPlan, the result of `CompressionCodec.compressedInputStream` is > wrapped in an Iterator which never calls close. This implementation needs to > make use of NextIterator which allows for the closing of underlying streams. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38273) Native memory leak in SparkPlan
[ https://issues.apache.org/jira/browse/SPARK-38273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38273: - Assignee: Kevin Sewell > Native memory leak in SparkPlan > --- > > Key: SPARK-38273 > URL: https://issues.apache.org/jira/browse/SPARK-38273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: Kevin Sewell >Assignee: Kevin Sewell >Priority: Major > > SPARK-34647 replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. > This meant that all usages of `CompressionCodec.compressedInputStream` would > need to manually close the stream as this would no longer be handled by GC > finaliser mechanism. > In SparkPlan, the result of `CompressionCodec.compressedInputStream` is > wrapped in an Iterator which never calls close. This implementation needs to > make use of NextIterator which allows for the closing of underlying streams. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38300) Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up duplicate codes
[ https://issues.apache.org/jira/browse/SPARK-38300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38300. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35622 [https://github.com/apache/spark/pull/35622] > Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up > duplicate codes > -- > > Key: SPARK-38300 > URL: https://issues.apache.org/jira/browse/SPARK-38300 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38300) Use ByteStreams.toByteArray to simplify fileToString and resourceToBytes in catalyst.uti
[ https://issues.apache.org/jira/browse/SPARK-38300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38300: -- Summary: Use ByteStreams.toByteArray to simplify fileToString and resourceToBytes in catalyst.uti (was: Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up duplicate codes) > Use ByteStreams.toByteArray to simplify fileToString and resourceToBytes in > catalyst.uti > > > Key: SPARK-38300 > URL: https://issues.apache.org/jira/browse/SPARK-38300 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38300) Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up duplicate codes
[ https://issues.apache.org/jira/browse/SPARK-38300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38300: - Assignee: Yang Jie > Refactor `fileToString` and `resourceToBytes` in catalyst.util to clean up > duplicate codes > -- > > Key: SPARK-38300 > URL: https://issues.apache.org/jira/browse/SPARK-38300 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38317) Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH"
Jolan Rensen created SPARK-38317: Summary: Encoding of java.time.Period always results in "INTERVAL '0-0' YEAR TO MONTH" Key: SPARK-38317 URL: https://issues.apache.org/jira/browse/SPARK-38317 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1, 3.2.0 Reporter: Jolan Rensen ```val dates = Seq( Period.ZERO, Period.ofWeeks(2), ).toDS() dates.show(false)``` Results in: ``` ++ |value | ++ |INTERVAL '0-0' YEAR TO MONTH| |INTERVAL '0-0' YEAR TO MONTH| ++ ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497418#comment-17497418 ] Apache Spark commented on SPARK-38316: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/35652 > Fix > SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite > under ANSI mode > --- > > Key: SPARK-38316 > URL: https://issues.apache.org/jira/browse/SPARK-38316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38316: Assignee: Gengliang Wang (was: Apache Spark) > Fix > SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite > under ANSI mode > --- > > Key: SPARK-38316 > URL: https://issues.apache.org/jira/browse/SPARK-38316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38316: Assignee: Apache Spark (was: Gengliang Wang) > Fix > SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite > under ANSI mode > --- > > Key: SPARK-38316 > URL: https://issues.apache.org/jira/browse/SPARK-38316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38316) Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode
Gengliang Wang created SPARK-38316: -- Summary: Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode Key: SPARK-38316 URL: https://issues.apache.org/jira/browse/SPARK-38316 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36194) Remove the aggregation from left semi/anti join if the same aggregation has already been done on left side
[ https://issues.apache.org/jira/browse/SPARK-36194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497410#comment-17497410 ] Apache Spark commented on SPARK-36194: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/35651 > Remove the aggregation from left semi/anti join if the same aggregation has > already been done on left side > -- > > Key: SPARK-36194 > URL: https://issues.apache.org/jira/browse/SPARK-36194 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
[ https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497407#comment-17497407 ] Apache Spark commented on SPARK-38314: -- User 'Yaohua628' has created a pull request for this issue: https://github.com/apache/spark/pull/35650 > Fail to read parquet files after writing the hidden file metadata in > > > Key: SPARK-38314 > URL: https://issues.apache.org/jira/browse/SPARK-38314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > > Selecting and then writing df containing hidden file metadata column > `_metadata` into a file format like `parquet`, `delta` will still keep the > internal `Attribute` metadata information. Then when reading those `parquet`, > `delta` files again, it will actually break the code, because it wrongly > thinks user data schema named `_metadata` is a hidden file source metadata > column. > > Reproducible code: > {code:java} > // prepare a file source df > df.select("*", "_metadata") > .write.format("parquet").save(path) > spark.read.format("parquet").load(path) > .select("*").show(){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
[ https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497406#comment-17497406 ] Apache Spark commented on SPARK-38314: -- User 'Yaohua628' has created a pull request for this issue: https://github.com/apache/spark/pull/35650 > Fail to read parquet files after writing the hidden file metadata in > > > Key: SPARK-38314 > URL: https://issues.apache.org/jira/browse/SPARK-38314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > > Selecting and then writing df containing hidden file metadata column > `_metadata` into a file format like `parquet`, `delta` will still keep the > internal `Attribute` metadata information. Then when reading those `parquet`, > `delta` files again, it will actually break the code, because it wrongly > thinks user data schema named `_metadata` is a hidden file source metadata > column. > > Reproducible code: > {code:java} > // prepare a file source df > df.select("*", "_metadata") > .write.format("parquet").save(path) > spark.read.format("parquet").load(path) > .select("*").show(){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
[ https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38314: Assignee: Apache Spark > Fail to read parquet files after writing the hidden file metadata in > > > Key: SPARK-38314 > URL: https://issues.apache.org/jira/browse/SPARK-38314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Assignee: Apache Spark >Priority: Major > > Selecting and then writing df containing hidden file metadata column > `_metadata` into a file format like `parquet`, `delta` will still keep the > internal `Attribute` metadata information. Then when reading those `parquet`, > `delta` files again, it will actually break the code, because it wrongly > thinks user data schema named `_metadata` is a hidden file source metadata > column. > > Reproducible code: > {code:java} > // prepare a file source df > df.select("*", "_metadata") > .write.format("parquet").save(path) > spark.read.format("parquet").load(path) > .select("*").show(){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
[ https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38314: Assignee: (was: Apache Spark) > Fail to read parquet files after writing the hidden file metadata in > > > Key: SPARK-38314 > URL: https://issues.apache.org/jira/browse/SPARK-38314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > > Selecting and then writing df containing hidden file metadata column > `_metadata` into a file format like `parquet`, `delta` will still keep the > internal `Attribute` metadata information. Then when reading those `parquet`, > `delta` files again, it will actually break the code, because it wrongly > thinks user data schema named `_metadata` is a hidden file source metadata > column. > > Reproducible code: > {code:java} > // prepare a file source df > df.select("*", "_metadata") > .write.format("parquet").save(path) > spark.read.format("parquet").load(path) > .select("*").show(){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38315) Add a config to collect objects as Java 8 types in the Thrift server
[ https://issues.apache.org/jira/browse/SPARK-38315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497362#comment-17497362 ] Max Gekk commented on SPARK-38315: -- I am working on this. > Add a config to collect objects as Java 8 types in the Thrift server > > > Key: SPARK-38315 > URL: https://issues.apache.org/jira/browse/SPARK-38315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add new config that should control collect(), and allow to enable/disable to > Java 8 types in the Thrift server. The config should solve the following > issue: > When an user connects to the Thrift Server and a query involve a datasource > connect which doesn't handle Java8 types, the user observes the following > exception: > {code:java} > ERROR SparkExecuteStatementOperation: Error executing query with > ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while > encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid > external type for schema of timestamp > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > TimestampType, instantToMicros, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, > false) AS loan_perf_date#1125 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) > > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38315) Add a config to collect objects as Java 8 types in the Thrift server
Max Gekk created SPARK-38315: Summary: Add a config to collect objects as Java 8 types in the Thrift server Key: SPARK-38315 URL: https://issues.apache.org/jira/browse/SPARK-38315 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Assignee: Max Gekk Add new config that should control collect(), and allow to enable/disable to Java 8 types in the Thrift server. The config should solve the following issue: When an user connects to the Thrift Server and a query involve a datasource connect which doesn't handle Java8 types, the user observes the following exception: {code:java} ERROR SparkExecuteStatementOperation: Error executing query with ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for schema of timestamp if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, instantToMicros, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, false) AS loan_perf_date#1125 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497329#comment-17497329 ] Apache Spark commented on SPARK-37932: -- User 'chenzhx' has created a pull request for this issue: https://github.com/apache/spark/pull/35649 > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.apache.org/jira/browse/SPARK-37932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Feng Zhu >Priority: Major > Attachments: sql_and_exception > > > See the attachment for details, including SQL and the exception information. > * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side > subquery, Analyzer works as expected; > * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in > the right side subquery, Analyzer failed with "Resolved attribute(s) > LO_SUPPKEY#337 missing ...". > From the debug info, the problem seems to be occurred after the rule > DeduplicateRelations is applied. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37932) Analyzer can fail when join left side and right side are the same view
[ https://issues.apache.org/jira/browse/SPARK-37932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497324#comment-17497324 ] Apache Spark commented on SPARK-37932: -- User 'chenzhx' has created a pull request for this issue: https://github.com/apache/spark/pull/35649 > Analyzer can fail when join left side and right side are the same view > -- > > Key: SPARK-37932 > URL: https://issues.apache.org/jira/browse/SPARK-37932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Feng Zhu >Priority: Major > Attachments: sql_and_exception > > > See the attachment for details, including SQL and the exception information. > * sql1, there is a normal filter (LO_SUPPKEY > 10) in the right side > subquery, Analyzer works as expected; > * sql2, there is a HAVING filter(HAVING COUNT(DISTINCT LO_SUPPKEY) > 1) in > the right side subquery, Analyzer failed with "Resolved attribute(s) > LO_SUPPKEY#337 missing ...". > From the debug info, the problem seems to be occurred after the rule > DeduplicateRelations is applied. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
[ https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-38314: Description: Selecting and then writing df containing hidden file metadata column `_metadata` into a file format like `parquet`, `delta` will still keep the internal `Attribute` metadata information. Then when reading those `parquet`, `delta` files again, it will actually break the code, because it wrongly thinks user data schema named `_metadata` is a hidden file source metadata column. Reproducible code: {code:java} // prepare a file source df df.select("*", "_metadata") .write.format("parquet").save(path) spark.read.format("parquet").load(path) .select("*").show(){code} was: Selecting and then writing df containing hidden file metadata column `_metadata` into a file format like `parquet`, `delta` will still keep the internal `Attribute` metadata information. Then when reading those `parquet`, `delta` files again, it will actually break the code, because it wrongly thinks user data schema named `_metadata` is a hidden file source metadata column. Reproducible code: ``` // prepare a file source df df.select("*", "_metadata") .write.format("parquet").save(path) spark.read.format("parquet").load(path) .select("*").show() ``` > Fail to read parquet files after writing the hidden file metadata in > > > Key: SPARK-38314 > URL: https://issues.apache.org/jira/browse/SPARK-38314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > > Selecting and then writing df containing hidden file metadata column > `_metadata` into a file format like `parquet`, `delta` will still keep the > internal `Attribute` metadata information. Then when reading those `parquet`, > `delta` files again, it will actually break the code, because it wrongly > thinks user data schema named `_metadata` is a hidden file source metadata > column. > > Reproducible code: > {code:java} > // prepare a file source df > df.select("*", "_metadata") > .write.format("parquet").save(path) > spark.read.format("parquet").load(path) > .select("*").show(){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
Yaohua Zhao created SPARK-38314: --- Summary: Fail to read parquet files after writing the hidden file metadata in Key: SPARK-38314 URL: https://issues.apache.org/jira/browse/SPARK-38314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: Yaohua Zhao Selecting and then writing df containing hidden file metadata column `_metadata` into a file format like `parquet`, `delta` will still keep the internal `Attribute` metadata information. Then when reading those `parquet`, `delta` files again, it will actually break the code, because it wrongly thinks user data schema named `_metadata` is a hidden file source metadata column. Reproducible code: ``` // prepare a file source df df.select("*", "_metadata") .write.format("parquet").save(path) spark.read.format("parquet").load(path) .select("*").show() ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow
[ https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Bacchini updated SPARK-38285: Description: The following code with Spark 3.2.1 raises an exception: {code:python} import pyspark.sql.functions as F from pyspark.sql.types import StructType, StructField, ArrayType, StringType t = StructType([ StructField('o', ArrayType( StructType([ StructField('s', StringType(), False), StructField('b', ArrayType( StructType([ StructField('e', StringType(), False) ]), True), False) ]), True), False)]) value = { "o": [ { "s": "string1", "b": [ { "e": "string2" }, { "e": "string3" } ] }, { "s": "string4", "b": [ { "e": "string5" }, { "e": "string6" }, { "e": "string7" } ] } ] } df = ( spark.createDataFrame([value], schema=t) .select(F.explode("o").alias("eo")) .select("eo.b.e") ) df.show() {code} The exception message is: {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected. Please note that the issue seems to be related to SPARK-37577: I am using the same DataFrame schema, but this time I have populated it with non empty value. I think that this is bug because with the following configuration it works as expected: {code:python} spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False) {code} Update: The provided code is working with Spark 3.1.2 without problems, so it seems an error due to expression pruning. The expected result is: {code} +---+ |e | +---+ |[string2, string3] | |[string5, string6, string7]| +---+ {code} was: The following code with Spark 3.2.1 raises an exception: {code:python} import pyspark.sql.functions as F from pyspark.sql.types import StructType, StructField, ArrayType, StringType t = StructType([ StructField('o', ArrayType( StructType([
[jira] [Updated] (SPARK-38285) ClassCastException: GenericArrayData cannot be cast to InternalRow
[ https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Bacchini updated SPARK-38285: Description: The following code with Spark 3.2.1 raises an exception: {code:python} import pyspark.sql.functions as F from pyspark.sql.types import StructType, StructField, ArrayType, StringType t = StructType([ StructField('o', ArrayType( StructType([ StructField('s', StringType(), False), StructField('b', ArrayType( StructType([ StructField('e', StringType(), False) ]), True), False) ]), True), False)]) value = { "o": [ { "s": "string1", "b": [ { "e": "string2" }, { "e": "string3" } ] }, { "s": "string4", "b": [ { "e": "string5" }, { "e": "string6" }, { "e": "string7" } ] } ] } df = ( spark.createDataFrame([value], schema=t) .select(F.explode("o").alias("eo")) .select("eo.b.e") ) df.show() {code} The exception message is: {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected. Please note that the issue seems to be related to SPARK-37577: I am using the same DataFrame schema, but this time I have populated it with non empty value. I think that this is bug because with the following configuration it works as expected: {code:python} spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False) {code} Update: The provided code is working with Spark 3.1.2 without problems, so it seems an error due to expression pruning. was: The following code with Spark 3.2.1 raises an exception: {code:python} import pyspark.sql.functions as F from pyspark.sql.types import StructType, StructField, ArrayType, StringType t = StructType([ StructField('o', ArrayType( StructType([ StructField('s', StringType(), False), StructField('b', ArrayType( StructType([ StructField('e', StringType(), False) ]),