[jira] [Commented] (SPARK-36387) Fix Series.astype from datetime to nullable string
[ https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397828#comment-17397828 ] Hyukjin Kwon commented on SPARK-36387: -- If he's inactive, let's go ahead. [~itholic]. We should better include all of them into Spark 3.2. > Fix Series.astype from datetime to nullable string > -- > > Key: SPARK-36387 > URL: https://issues.apache.org/jira/browse/SPARK-36387 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > Attachments: image-2021-08-12-14-24-31-321.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36387) Fix Series.astype from datetime to nullable string
[ https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397825#comment-17397825 ] Haejoon Lee commented on SPARK-36387: - [~dc-heros], are you still working on this?? No need to rush, just for confirmation :) > Fix Series.astype from datetime to nullable string > -- > > Key: SPARK-36387 > URL: https://issues.apache.org/jira/browse/SPARK-36387 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > Attachments: image-2021-08-12-14-24-31-321.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36387) Fix Series.astype from datetime to nullable string
[ https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-36387: Attachment: image-2021-08-12-14-24-31-321.png > Fix Series.astype from datetime to nullable string > -- > > Key: SPARK-36387 > URL: https://issues.apache.org/jira/browse/SPARK-36387 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > Attachments: image-2021-08-12-14-24-31-321.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36479) improve datetime test coverage in SQL files
[ https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36479. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33707 [https://github.com/apache/spark/pull/33707] > improve datetime test coverage in SQL files > --- > > Key: SPARK-36479 > URL: https://issues.apache.org/jira/browse/SPARK-36479 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36479) improve datetime test coverage in SQL files
[ https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36479: --- Assignee: Wenchen Fan > improve datetime test coverage in SQL files > --- > > Key: SPARK-36479 > URL: https://issues.apache.org/jira/browse/SPARK-36479 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33807) Data Source V2: Remove read specific distributions
[ https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-33807: Affects Version/s: (was: 3.2.0) 3.3.0 > Data Source V2: Remove read specific distributions > -- > > Key: SPARK-33807 > URL: https://issues.apache.org/jira/browse/SPARK-33807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Priority: Blocker > > We should remove the read-specific distributions for DS V2 as discussed > [here|https://github.com/apache/spark/pull/30706#discussion_r543059827]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36486) Support type constructed string as dat time interval
[ https://issues.apache.org/jira/browse/SPARK-36486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36486: Assignee: (was: Apache Spark) > Support type constructed string as dat time interval > > > Key: SPARK-36486 > URL: https://issues.apache.org/jira/browse/SPARK-36486 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36486) Support type constructed string as dat time interval
[ https://issues.apache.org/jira/browse/SPARK-36486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36486: Assignee: Apache Spark > Support type constructed string as dat time interval > > > Key: SPARK-36486 > URL: https://issues.apache.org/jira/browse/SPARK-36486 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36486) Support type constructed string as dat time interval
[ https://issues.apache.org/jira/browse/SPARK-36486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397804#comment-17397804 ] Apache Spark commented on SPARK-36486: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33717 > Support type constructed string as dat time interval > > > Key: SPARK-36486 > URL: https://issues.apache.org/jira/browse/SPARK-36486 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36485) Support cast type constructed string to year month interval
[ https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1739#comment-1739 ] Apache Spark commented on SPARK-36485: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33716 > Support cast type constructed string to year month interval > --- > > Key: SPARK-36485 > URL: https://issues.apache.org/jira/browse/SPARK-36485 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > cast('1 year 2 month' as interval year to month) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36485) Support cast type constructed string to year month interval
[ https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36485: Assignee: (was: Apache Spark) > Support cast type constructed string to year month interval > --- > > Key: SPARK-36485 > URL: https://issues.apache.org/jira/browse/SPARK-36485 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > cast('1 year 2 month' as interval year to month) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36485) Support cast type constructed string to year month interval
[ https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36485: Assignee: Apache Spark > Support cast type constructed string to year month interval > --- > > Key: SPARK-36485 > URL: https://issues.apache.org/jira/browse/SPARK-36485 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > cast('1 year 2 month' as interval year to month) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36485) Support cast type constructed string to year month interval
[ https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397778#comment-17397778 ] Apache Spark commented on SPARK-36485: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33716 > Support cast type constructed string to year month interval > --- > > Key: SPARK-36485 > URL: https://issues.apache.org/jira/browse/SPARK-36485 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > cast('1 year 2 month' as interval year to month) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36486) Support type constructed string as dat time interval
angerszhu created SPARK-36486: - Summary: Support type constructed string as dat time interval Key: SPARK-36486 URL: https://issues.apache.org/jira/browse/SPARK-36486 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-36480: --- Assignee: Jungtaek Lim > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Critical > Fix For: 3.2.0 > > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397775#comment-17397775 ] L. C. Hsieh commented on SPARK-36480: - The issue was resolved at https://github.com/apache/spark/pull/33708. > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Critical > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-36480. - Resolution: Fixed > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Critical > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-36480: Fix Version/s: 3.2.0 > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Critical > Fix For: 3.2.0 > > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36485) Support cast type constructed string to year month interval
angerszhu created SPARK-36485: - Summary: Support cast type constructed string to year month interval Key: SPARK-36485 URL: https://issues.apache.org/jira/browse/SPARK-36485 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu cast('1 year 2 month' as interval year to month) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32986) Add bucket scan info in explain output of FileSourceScanExec
[ https://issues.apache.org/jira/browse/SPARK-32986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32986: - Assignee: Cheng Su > Add bucket scan info in explain output of FileSourceScanExec > > > Key: SPARK-32986 > URL: https://issues.apache.org/jira/browse/SPARK-32986 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > > As a followup from discussion in > [https://github.com/apache/spark/pull/29804#discussion_r493229395] , > currently for `FileSourceScanExec` explain output, there's no information to > indicate whether the table is read as bucketed table. And if not read as > bucketed table, what's the reason behind it. Add this info into > `FileSourceScanExec` explain output, can help users/developers understand > query plan more easily without spending a lot of time debugging why table is > not read as bucketed table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32986) Add bucket scan info in explain output of FileSourceScanExec
[ https://issues.apache.org/jira/browse/SPARK-32986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32986. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33698 [https://github.com/apache/spark/pull/33698] > Add bucket scan info in explain output of FileSourceScanExec > > > Key: SPARK-32986 > URL: https://issues.apache.org/jira/browse/SPARK-32986 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.3.0 > > > As a followup from discussion in > [https://github.com/apache/spark/pull/29804#discussion_r493229395] , > currently for `FileSourceScanExec` explain output, there's no information to > indicate whether the table is read as bucketed table. And if not read as > bucketed table, what's the reason behind it. Add this info into > `FileSourceScanExec` explain output, can help users/developers understand > query plan more easily without spending a lot of time debugging why table is > not read as bucketed table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397747#comment-17397747 ] Dongjoon Hyun commented on SPARK-18105: --- I did only the above code I posted because you wrote like this. What did you mean by `can be replaced with this` then? {code:java} //So this line of code: ... // can be replaced with this Dataset relevantPivots = spark.read().parquet(pathToDataDedup).repartition(5000).cache(); {code} > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34081) Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join
[ https://issues.apache.org/jira/browse/SPARK-34081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-34081: --- Issue Type: Improvement (was: Bug) > Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as > broadcast join > --- > > Key: SPARK-34081 > URL: https://issues.apache.org/jira/browse/SPARK-34081 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Should not pushdown LeftSemi/LeftAnti over Aggregate for some cases. > {code:scala} > spark.range(5000L).selectExpr("id % 1 as a", "id % 1 as > b").write.saveAsTable("t1") spark.range(4000L).selectExpr("id % 8000 as > c", "id % 8000 as d").write.saveAsTable("t2") > spark.sql("SELECT distinct a, b FROM t1 INTERSECT SELECT distinct c, d FROM > t2").explain > {code} > Current: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[a#16L, b#17L], functions=[]) >+- HashAggregate(keys=[a#16L, b#17L], functions=[]) > +- HashAggregate(keys=[a#16L, b#17L], functions=[]) > +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, > [id=#72] > +- HashAggregate(keys=[a#16L, b#17L], functions=[]) >+- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), > coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), > coalesce(d#19L, 0), isnull(d#19L)], LeftSemi > :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) > ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS > FIRST], false, 0 > : +- Exchange hashpartitioning(coalesce(a#16L, 0), > isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, > [id=#65] > : +- FileScan parquet default.t1[a#16L,b#17L] Batched: > true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) > ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS > FIRST], false, 0 > +- Exchange hashpartitioning(coalesce(c#18L, 0), > isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, > [id=#66] > +- HashAggregate(keys=[c#18L, d#19L], functions=[]) >+- Exchange hashpartitioning(c#18L, d#19L, 5), > ENSURE_REQUIREMENTS, [id=#61] > +- HashAggregate(keys=[c#18L, d#19L], > functions=[]) > +- FileScan parquet default.t2[c#18L,d#19L] > Batched: true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > {noformat} > > Expected: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[a#16L, b#17L], functions=[]) >+- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, > [id=#74] > +- HashAggregate(keys=[a#16L, b#17L], functions=[]) > +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, > 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), > isnull(d#19L)], LeftSemi > :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC > NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS > FIRST], false, 0 > : +- Exchange hashpartitioning(coalesce(a#16L, 0), > isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, > [id=#67] > : +- HashAggregate(keys=[a#16L, b#17L], functions=[]) > :+- Exchange hashpartitioning(a#16L, b#17L, 5), > ENSURE_REQUIREMENTS, [id=#61] > : +- HashAggregate(keys=[a#16L, b#17L], functions=[]) > : +- FileScan parquet default.t1[a#16L,b#17L] > Batched: true, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC > NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS > FIRST], false, 0 >+- Exchange hashpartitioning(coalesce(c#18L, 0), > isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5),
[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow
[ https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397729#comment-17397729 ] Hyukjin Kwon commented on SPARK-32285: -- Arrow version is 2.0.0 now. Can we fix this ticket? > Add PySpark support for nested timestamps with arrow > > > Key: SPARK-32285 > URL: https://issues.apache.org/jira/browse/SPARK-32285 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently with arrow optimizations, there is post-processing done in pandas > for timestamp columns to localize timezone. This is not done for nested > columns with timestamps such as StructType or ArrayType. > Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use > of structs with timestamps in groupedby key over a window. > As a simple first step, timestamps with 1 level nesting could be done first > and this will satisfy the immediate need. > NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing > with pyarrow.array.cast, which could be easier done than in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397725#comment-17397725 ] Mridul Muralidharan commented on SPARK-36378: - Thanks for marking this resolved [~Ngone51] ! The cherry pick messed up the merge script and I had to push it manually - which skipped the jira update :-) > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Min Shen >Priority: Major > Fix For: 3.2.0, 3.3.0 > > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36484) Handle Stale block fetch failure on the client side by not retrying the requests
Venkata krishnan Sowrirajan created SPARK-36484: --- Summary: Handle Stale block fetch failure on the client side by not retrying the requests Key: SPARK-36484 URL: https://issues.apache.org/jira/browse/SPARK-36484 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.2.0 Reporter: Venkata krishnan Sowrirajan Similar to SPARK-36378, we need to handle the stale block fetch failures on the client side, although without handling it there won't be any correctness issues. This would help in saving some server side bandwidth unnecessarily serving stale shuffle fetch requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36481) Expose LogisticRegression.setInitialModel
[ https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-36481. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33710 [https://github.com/apache/spark/pull/33710] > Expose LogisticRegression.setInitialModel > - > > Key: SPARK-36481 > URL: https://issues.apache.org/jira/browse/SPARK-36481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.3.0 > > > Several Spark ML components already allow setting of an initial model, > including KMeans, LogisticRegression, and GaussianMixture. This is useful to > begin training from a known reasonably good model. > However, the method in LogisticRegression is private to Spark. I don't see a > good reason why it should be as the others in KMeans et al are not. > None of these are exposed in Pyspark, which I don't necessarily want to > question or deal with now; there are other places one could arguably set an > initial model too, but, here just interested in exposing the existing, tested > functionality to callers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump
[ https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36483: Assignee: Apache Spark > Fix intermittent test failure due to netty dependency version bump > -- > > Key: SPARK-36483 > URL: https://issues.apache.org/jira/browse/SPARK-36483 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Apache Spark >Priority: Major > > In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to > 4.1.63. > Since Netty version 4.1.52, a Netty specific > io.netty.channel.StacklessClosedChannelException gets thrown when Netty's > AbstractChannel encounters a closed channel. > This can sometimes break the test > org.apache.spark.network.RPCIntegrationSuite as reported > [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].] > This is due to the hardcoded list of exception messages to check in > RPCIntegrationSuite does not include this new StacklessClosedChannelException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump
[ https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36483: Assignee: (was: Apache Spark) > Fix intermittent test failure due to netty dependency version bump > -- > > Key: SPARK-36483 > URL: https://issues.apache.org/jira/browse/SPARK-36483 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Priority: Major > > In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to > 4.1.63. > Since Netty version 4.1.52, a Netty specific > io.netty.channel.StacklessClosedChannelException gets thrown when Netty's > AbstractChannel encounters a closed channel. > This can sometimes break the test > org.apache.spark.network.RPCIntegrationSuite as reported > [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].] > This is due to the hardcoded list of exception messages to check in > RPCIntegrationSuite does not include this new StacklessClosedChannelException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump
[ https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397639#comment-17397639 ] Apache Spark commented on SPARK-36483: -- User 'Victsm' has created a pull request for this issue: https://github.com/apache/spark/pull/33713 > Fix intermittent test failure due to netty dependency version bump > -- > > Key: SPARK-36483 > URL: https://issues.apache.org/jira/browse/SPARK-36483 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Priority: Major > > In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to > 4.1.63. > Since Netty version 4.1.52, a Netty specific > io.netty.channel.StacklessClosedChannelException gets thrown when Netty's > AbstractChannel encounters a closed channel. > This can sometimes break the test > org.apache.spark.network.RPCIntegrationSuite as reported > [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].] > This is due to the hardcoded list of exception messages to check in > RPCIntegrationSuite does not include this new StacklessClosedChannelException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump
[ https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Shen updated SPARK-36483: - Description: In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to 4.1.63. Since Netty version 4.1.52, a Netty specific io.netty.channel.StacklessClosedChannelException gets thrown when Netty's AbstractChannel encounters a closed channel. This can sometimes break the test org.apache.spark.network.RPCIntegrationSuite as reported [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].] This is due to the hardcoded list of exception messages to check in RPCIntegrationSuite does not include this new StacklessClosedChannelException. > Fix intermittent test failure due to netty dependency version bump > -- > > Key: SPARK-36483 > URL: https://issues.apache.org/jira/browse/SPARK-36483 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Priority: Major > > In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to > 4.1.63. > Since Netty version 4.1.52, a Netty specific > io.netty.channel.StacklessClosedChannelException gets thrown when Netty's > AbstractChannel encounters a closed channel. > This can sometimes break the test > org.apache.spark.network.RPCIntegrationSuite as reported > [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].] > This is due to the hardcoded list of exception messages to check in > RPCIntegrationSuite does not include this new StacklessClosedChannelException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump
Min Shen created SPARK-36483: Summary: Fix intermittent test failure due to netty dependency version bump Key: SPARK-36483 URL: https://issues.apache.org/jira/browse/SPARK-36483 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Min Shen -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397603#comment-17397603 ] Cameron Todd commented on SPARK-18105: -- Good to hear. Also the count of 136,935,074 is right. When you say "I downloaded and ran the test code in my laptop. It works well to me". You mean the java code I attached or do you mean the read file and count() that you just attached above. Because the error is only raised after the first few aggregations and joins in my code and I also doubt your laptop is big enough to process that ;) . I can attach key config variables on your request but I'm using a cloud Kubernetes spark cluster managed by OVH (spark 3.0.1) so it's out of box solution managed by OVH not really me. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397585#comment-17397585 ] Dongjoon Hyun commented on SPARK-18105: --- BTW, I checked that a single zip file contains 157 snappy parquet files which is generated by Apache Spark 3.0.1. {code} $ ls -al /Users/dongjoon/data/SPARK-18105/hash_import.parquet | grep parquet | wc -l 157 $ ls -al /Users/dongjoon/data/SPARK-18105/hash_import.parquet total 5040768 drwxr-xr-x 160 dongjoon staff 5120 Aug 9 02:55 . drwxr-xr-x4 dongjoon staff 128 Aug 11 12:32 .. -rw-r--r--1 dongjoon staff 0 Aug 9 02:21 _SUCCESS -rw-r--r--1 dongjoon staff 7098577 Aug 9 02:19 part-0-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 9310705 Aug 9 02:19 part-1-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 9286598 Aug 9 02:19 part-2-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 900 Aug 9 02:19 part-3-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 7174124 Aug 9 02:19 part-4-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 7269784 Aug 9 02:19 part-5-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 9045563 Aug 9 02:19 part-6-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 8874048 Aug 9 02:19 part-7-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 9088055 Aug 9 02:19 part-8-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 7921907 Aug 9 02:19 part-9-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 9734530 Aug 9 02:19 part-00010-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 11430364 Aug 9 02:19 part-00011-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 10106472 Aug 9 02:19 part-00012-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 11672118 Aug 9 02:19 part-00013-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 11077873 Aug 9 02:19 part-00014-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 17459766 Aug 9 02:19 part-00015-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 18000141 Aug 9 02:19 part-00016-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 10688008 Aug 9 02:19 part-00017-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 10876962 Aug 9 02:19 part-00018-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 6001747 Aug 9 02:19 part-00019-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 13051685 Aug 9 02:19 part-00020-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 40607547 Aug 9 02:19 part-00021-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 16122066 Aug 9 02:19 part-00022-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 7858387 Aug 9 02:19 part-00023-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 7953870 Aug 9 02:19 part-00024-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 20777892 Aug 9 02:19 part-00025-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 7785559 Aug 9 02:19 part-00026-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 6101494 Aug 9 02:19 part-00027-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 14878775 Aug 9 02:19 part-00028-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 8807412 Aug 9 02:19 part-00029-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff664925 Aug 9 02:19 part-00030-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 36711578 Aug 9 02:19 part-00031-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 6376692 Aug 9 02:19 part-00032-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 9121175 Aug 9 02:19 part-00033-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet -rw-r--r--1 dongjoon staff 12318653 Aug 9 02:19
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397584#comment-17397584 ] Dongjoon Hyun commented on SPARK-18105: --- Thank you, [~cameron.todd]. I downloaded and ran the test code in my laptop. It works well to me. Do you have some other configurations with you? {code} SPARK-18105:$ du -h hash_import.parquet 2.4Ghash_import.parquet {code} {code} scala> val df = spark.read.parquet("/Users/dongjoon/data/SPARK-18105/hash_import.parquet").repartition(5000).cache() df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, pivot_hash: int ... 1 more field] scala> df.count() res3: Long = 136935074 scala> spark.version res4: String = 3.1.2 {code} > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34827: -- Affects Version/s: 3.3.0 > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397569#comment-17397569 ] Dongjoon Hyun commented on SPARK-34827: --- Thank you. I moved the Target Version to 3.3.0. > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34827: -- Target Version/s: 3.3.0 (was: 3.2.0) > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397563#comment-17397563 ] Apache Spark commented on SPARK-36378: -- User 'Victsm' has created a pull request for this issue: https://github.com/apache/spark/pull/33713 > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Min Shen >Priority: Major > Fix For: 3.2.0, 3.3.0 > > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36482) Bump orc to 1.6.10
[ https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36482. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33712 [https://github.com/apache/spark/pull/33712] > Bump orc to 1.6.10 > -- > > Key: SPARK-36482 > URL: https://issues.apache.org/jira/browse/SPARK-36482 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: William Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36482) Bump orc to 1.6.10
[ https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36482: - Assignee: William Hyun > Bump orc to 1.6.10 > -- > > Key: SPARK-36482 > URL: https://issues.apache.org/jira/browse/SPARK-36482 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36482) Bump orc to 1.6.10
[ https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36482: Assignee: Apache Spark > Bump orc to 1.6.10 > -- > > Key: SPARK-36482 > URL: https://issues.apache.org/jira/browse/SPARK-36482 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36482) Bump orc to 1.6.10
[ https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397486#comment-17397486 ] Apache Spark commented on SPARK-36482: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/33712 > Bump orc to 1.6.10 > -- > > Key: SPARK-36482 > URL: https://issues.apache.org/jira/browse/SPARK-36482 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36482) Bump orc to 1.6.10
[ https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36482: Assignee: (was: Apache Spark) > Bump orc to 1.6.10 > -- > > Key: SPARK-36482 > URL: https://issues.apache.org/jira/browse/SPARK-36482 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36482) Bump orc to 1.6.10
William Hyun created SPARK-36482: Summary: Bump orc to 1.6.10 Key: SPARK-36482 URL: https://issues.apache.org/jira/browse/SPARK-36482 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36446) YARN shuffle server restart crashes all dynamic allocation jobs that have deallocated an executor
[ https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397461#comment-17397461 ] Thomas Graves commented on SPARK-36446: --- [~adamkennedy77] ^ > YARN shuffle server restart crashes all dynamic allocation jobs that have > deallocated an executor > - > > Key: SPARK-36446 > URL: https://issues.apache.org/jira/browse/SPARK-36446 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.8, 3.1.2 >Reporter: Adam Kennedy >Priority: Critical > > When dynamic allocation is enabled, executors that deallocate rely on the > shuffle server to hold blocks and supply them to remaining executors. > When YARN Shuffle Server restarts (either intentionally or due to a crash), > it loses block information and relies on being able to contact Executors (the > locations of which it durably stores) to refetch the list of blocks. > This mutual dependency on the other to hold block information fails fatally > under some common scenarios. > For example, if a Spark application is running under dynamic allocation, some > amount of executors will almost always shut down. > If, after this has occurred, any shuffle server crashes, or is restarted > (either directly when running as a standalone service, or as part of a YARN > node manager restart) then there is no way to restore block data and it is > permanently lost. > Worse, when Executors try to fetch blocks from the shuffle server, the > shuffle server cannot location the exeutor, decides it doesn't exist, treats > it as a fatal exception, and causes the application to terminate and crash. > Thus, in a real world scenario that we observe on a 1000+ node multi-tenant > cluster where dynamic allocation is on by default, a rolling restart of the > YARN node managers will cause ALL jobs that have deallocated any executor and > have shuffles or transferred blocks to the shuffle server in order to shut > down, to crash. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36481) Expose LogisticRegression.setInitialModel
[ https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36481: Assignee: Sean R. Owen (was: Apache Spark) > Expose LogisticRegression.setInitialModel > - > > Key: SPARK-36481 > URL: https://issues.apache.org/jira/browse/SPARK-36481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > Several Spark ML components already allow setting of an initial model, > including KMeans, LogisticRegression, and GaussianMixture. This is useful to > begin training from a known reasonably good model. > However, the method in LogisticRegression is private to Spark. I don't see a > good reason why it should be as the others in KMeans et al are not. > None of these are exposed in Pyspark, which I don't necessarily want to > question or deal with now; there are other places one could arguably set an > initial model too, but, here just interested in exposing the existing, tested > functionality to callers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36481) Expose LogisticRegression.setInitialModel
[ https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36481: Assignee: Apache Spark (was: Sean R. Owen) > Expose LogisticRegression.setInitialModel > - > > Key: SPARK-36481 > URL: https://issues.apache.org/jira/browse/SPARK-36481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: Sean R. Owen >Assignee: Apache Spark >Priority: Minor > > Several Spark ML components already allow setting of an initial model, > including KMeans, LogisticRegression, and GaussianMixture. This is useful to > begin training from a known reasonably good model. > However, the method in LogisticRegression is private to Spark. I don't see a > good reason why it should be as the others in KMeans et al are not. > None of these are exposed in Pyspark, which I don't necessarily want to > question or deal with now; there are other places one could arguably set an > initial model too, but, here just interested in exposing the existing, tested > functionality to callers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36481) Expose LogisticRegression.setInitialModel
[ https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397445#comment-17397445 ] Apache Spark commented on SPARK-36481: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/33710 > Expose LogisticRegression.setInitialModel > - > > Key: SPARK-36481 > URL: https://issues.apache.org/jira/browse/SPARK-36481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > Several Spark ML components already allow setting of an initial model, > including KMeans, LogisticRegression, and GaussianMixture. This is useful to > begin training from a known reasonably good model. > However, the method in LogisticRegression is private to Spark. I don't see a > good reason why it should be as the others in KMeans et al are not. > None of these are exposed in Pyspark, which I don't necessarily want to > question or deal with now; there are other places one could arguably set an > initial model too, but, here just interested in exposing the existing, tested > functionality to callers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36481) Expose LogisticRegression.setInitialModel
Sean R. Owen created SPARK-36481: Summary: Expose LogisticRegression.setInitialModel Key: SPARK-36481 URL: https://issues.apache.org/jira/browse/SPARK-36481 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.2.0 Reporter: Sean R. Owen Assignee: Sean R. Owen Several Spark ML components already allow setting of an initial model, including KMeans, LogisticRegression, and GaussianMixture. This is useful to begin training from a known reasonably good model. However, the method in LogisticRegression is private to Spark. I don't see a good reason why it should be as the others in KMeans et al are not. None of these are exposed in Pyspark, which I don't necessarily want to question or deal with now; there are other places one could arguably set an initial model too, but, here just interested in exposing the existing, tested functionality to callers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Description: The StreamingSource includes the appName in its sourceName. However, the appName should not be handled by the StreamingSource. It is already handled by the MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource. Why is this important? See this part from the [documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]: ??Often times, users want to be able to track the metrics across apps for driver and executors, which is hard to do with application ID (i.e. {{spark.app.id}}) since it changes with every invocation of the app. For such use cases, a custom namespace can be specified for metrics reporting using {{spark.metrics.namespace}} configuration property. If, say, users wanted to set the metrics namespace to the name of the application, they can set the {{spark.metrics.namespace}} property to a value like {{$\{spark.app.name}}}. This value is then expanded appropriately by Spark and is used as the root namespace of the metrics system.?? This is only possible if the MetricsSystem handles the namespace which it does. But the StreamingSource additionally adds the appName in its sourceName, thus there is no way to configure a namespace that does not include the appName. was: The StreamingSource includes the appName in its sourceName. The appName should not be handled by the StreamingSource. It is already handled by the MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource. Why is this important? See this part from the [documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]: ??Often times, users want to be able to track the metrics across apps for driver and executors, which is hard to do with application ID (i.e. {{spark.app.id}}) since it changes with every invocation of the app. For such use cases, a custom namespace can be specified for metrics reporting using {{spark.metrics.namespace}} configuration property. If, say, users wanted to set the metrics namespace to the name of the application, they can set the {{spark.metrics.namespace}} property to a value like {{${spark.app.name}}}. This value is then expanded appropriately by Spark and is used as the root namespace of the metrics system.?? This is only possible if the MetricsSystem handles the namespace which it does. But the StreamingSource additionally adds the appName in its sourceName, thus there is no way to configure a namespace that does not include the appName. > Delete appName from StreamingSource > > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Blocker > > The StreamingSource includes the appName in its sourceName. However, the > appName should not be handled by the StreamingSource. It is already handled > by the MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource. > Why is this important? See this part from the > [documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]: > ??Often times, users want to be able to track the metrics across apps for > driver and executors, which is hard to do with application ID (i.e. > {{spark.app.id}}) since it changes with every invocation of the app. For such > use cases, a custom namespace can be specified for metrics reporting using > {{spark.metrics.namespace}} configuration property. If, say, users wanted to > set the metrics namespace to the name of the application, they can set the > {{spark.metrics.namespace}} property to a value like {{$\{spark.app.name}}}. > This value is then expanded appropriately by Spark and is used as the root > namespace of the metrics system.?? > This is only possible if the MetricsSystem handles the namespace which it > does. But the StreamingSource additionally adds the appName in its > sourceName, thus there is no way to configure a namespace that does not > include the appName. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Description: The StreamingSource includes the appName in its sourceName. The appName should not be handled by the StreamingSource. It is already handled by the MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource. Why is this important? See this part from the [documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]: ??Often times, users want to be able to track the metrics across apps for driver and executors, which is hard to do with application ID (i.e. {{spark.app.id}}) since it changes with every invocation of the app. For such use cases, a custom namespace can be specified for metrics reporting using {{spark.metrics.namespace}} configuration property. If, say, users wanted to set the metrics namespace to the name of the application, they can set the {{spark.metrics.namespace}} property to a value like {{${spark.app.name}}}. This value is then expanded appropriately by Spark and is used as the root namespace of the metrics system.?? This is only possible if the MetricsSystem handles the namespace which it does. But the StreamingSource additionally adds the appName in its sourceName, thus there is no way to configure a namespace that does not include the appName. was: The StreamingSource includes the appName in its sourceName. This is not desired when using a custom namespace for metrics reporting using {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} cannot be excluded in the name of the metric. Using a metrics namespace results in a duplicated indicator for {{spark.app.name}}. All other metrics do not include the appName in their sourceName, e.g. ExecutorMetricsSource. If desired you can configure that with a metrics namespace. The same should be done in StreamingSource. > Delete appName from StreamingSource > > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Blocker > > The StreamingSource includes the appName in its sourceName. The appName > should not be handled by the StreamingSource. It is already handled by the > MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource. > Why is this important? See this part from the > [documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]: > ??Often times, users want to be able to track the metrics across apps for > driver and executors, which is hard to do with application ID (i.e. > {{spark.app.id}}) since it changes with every invocation of the app. For such > use cases, a custom namespace can be specified for metrics reporting using > {{spark.metrics.namespace}} configuration property. If, say, users wanted to > set the metrics namespace to the name of the application, they can set the > {{spark.metrics.namespace}} property to a value like {{${spark.app.name}}}. > This value is then expanded appropriately by Spark and is used as the root > namespace of the metrics system.?? > This is only possible if the MetricsSystem handles the namespace which it > does. But the StreamingSource additionally adds the appName in its > sourceName, thus there is no way to configure a namespace that does not > include the appName. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
[ https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-36478: Description: Removes outer join if all grouping and aggregate expressions are from the streamed side. For example: {code:java} spark.range(200L).selectExpr("id AS a", "id as b", "id as c").createTempView("t1") spark.range(300L).selectExpr("id AS a").createTempView("t2") spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a GROUP BY t1.b").explain(true) {code} Current optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] +- Project [b#3L, c#4L] +- Join LeftOuter, (a#2L = a#10L) :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] : +- Range (0, 200, step=1, splits=Some(1)) +- Project [id#8L AS a#10L] +- Range (0, 300, step=1, splits=Some(1)) {code} Expected optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] +- Project [id#274L AS b#277L, id#274L AS c#278L] +- Range (0, 200, step=1, splits=Some(2)) {code} was: Removes outer join if all grouping and aggregate expressions are from the streamed side. For example: {code:java} spark.range(200L).selectExpr("id AS a", "id as b", "id as c").createTempView("t1") spark.range(300L).selectExpr("id AS a", "id as b", "id as c").createTempView("t2") spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a GROUP BY t1.b").explain(true) {code} Current optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] +- Project [b#3L, c#4L] +- Join LeftOuter, (a#2L = a#10L) :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] : +- Range (0, 200, step=1, splits=Some(1)) +- Project [id#8L AS a#10L] +- Range (0, 300, step=1, splits=Some(1)) {code} Expected optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] +- Project [id#274L AS b#277L, id#274L AS c#278L] +- Range (0, 200, step=1, splits=Some(2)) {code} > Removes outer join if all grouping and aggregate expressions are from the > streamed side > --- > > Key: SPARK-36478 > URL: https://issues.apache.org/jira/browse/SPARK-36478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wan Kun >Priority: Minor > > Removes outer join if all grouping and aggregate expressions are from the > streamed side. > For example: > {code:java} > spark.range(200L).selectExpr("id AS a", "id as b", "id as > c").createTempView("t1") > spark.range(300L).selectExpr("id AS a").createTempView("t2") > spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a > GROUP BY t1.b").explain(true) > {code} > Current optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] > +- Project [b#3L, c#4L] >+- Join LeftOuter, (a#2L = a#10L) > :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] > : +- Range (0, 200, step=1, splits=Some(1)) > +- Project [id#8L AS a#10L] > +- Range (0, 300, step=1, splits=Some(1)) > {code} > Expected optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] > +- Project [id#274L AS b#277L, id#274L AS c#278L] >+- Range (0, 200, step=1, splits=Some(2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Description: The StreamingSource includes the appName in its sourceName. This is not desired when using a custom namespace for metrics reporting using {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} cannot be excluded in the name of the metric. Using a metrics namespace results in a duplicated indicator for {{spark.app.name}}. All other metrics do not include the appName in their sourceName, e.g. ExecutorMetricsSource. If desired you can configure that with a metrics namespace. The same should be done in StreamingSource. was: The StreamingSource includes the appName in its sourceName. This is not desired for people using a custom namespace for metrics reporting using {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} cannot be excluded in the name of the metric. Using a metrics namespace results in a duplicated indicator for {{spark.app.name}}. All other metrics do not include the appName in their sourceName, e.g. ExecutorMetricsSource. If desired you can configure that with a metrics namespace. The same should be done in StreamingSource. > Delete appName from StreamingSource > > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Blocker > > The StreamingSource includes the appName in its sourceName. This is not > desired when using a custom namespace for metrics reporting using > {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} > cannot be excluded in the name of the metric. Using a metrics namespace > results in a duplicated indicator for {{spark.app.name}}. > All other metrics do not include the appName in their sourceName, e.g. > ExecutorMetricsSource. If desired you can configure that with a metrics > namespace. The same should be done in StreamingSource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Summary: Delete appName from StreamingSource (was: StreamingSource duplicates appName) > Delete appName from StreamingSource > > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Blocker > > The StreamingSource includes the appName in its sourceName. This is not > desired for people using a custom namespace for metrics reporting using > {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} > cannot be excluded in the name of the metric. Using a metrics namespace > results in a duplicated indicator for {{spark.app.name}}. > All other metrics do not include the appName in their sourceName, e.g. > ExecutorMetricsSource. If desired you can configure that with a metrics > namespace. The same should be done in StreamingSource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Description: The StreamingSource includes the appName in its sourceName. This is not desired for people using a custom namespace for metrics reporting using {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} cannot be excluded in the name of the metric. Using a metrics namespace results in a duplicated indicator for {{spark.app.name}}. All other metrics do not include the appName in their sourceName, e.g. ExecutorMetricsSource. If desired you can configure that with a metrics namespace. The same should be done in StreamingSource. was: The StreamingSource includes the appName in its sourceName. This is not desired for people using a custom namespace for metrics reporting using {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} cannot be excluded in the name of the metric. Using a metrics namespace results in a duplicated indicator for {{spark.app.name}}. All other metrics do not include the appName in their sourceName, e.g. ExecutorMetricsSource. If desired you can configure that with a metrics namespace. > StreamingSource duplicates appName > -- > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Blocker > > The StreamingSource includes the appName in its sourceName. This is not > desired for people using a custom namespace for metrics reporting using > {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} > cannot be excluded in the name of the metric. Using a metrics namespace > results in a duplicated indicator for {{spark.app.name}}. > All other metrics do not include the appName in their sourceName, e.g. > ExecutorMetricsSource. If desired you can configure that with a metrics > namespace. The same should be done in StreamingSource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Description: The StreamingSource includes the appName in its sourceName. This is not desired for people using a custom namespace for metrics reporting using {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} cannot be excluded in the name of the metric. Using a metrics namespace results in a duplicated indicator for {{spark.app.name}}. All other metrics do not include the appName in their sourceName, e.g. ExecutorMetricsSource. If desired you can configure that with a metrics namespace. was:The StreamingSource includes the appName in its sourceName. This is not desired for people using a custom namespace for metrics reporting using {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} cannot be excluded in the name of the metric. Using a metrics namespace results in a duplicated indicator for {{spark.app.name}}. > StreamingSource duplicates appName > -- > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Blocker > > The StreamingSource includes the appName in its sourceName. This is not > desired for people using a custom namespace for metrics reporting using > {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} > cannot be excluded in the name of the metric. Using a metrics namespace > results in a duplicated indicator for {{spark.app.name}}. > All other metrics do not include the appName in their sourceName, e.g. > ExecutorMetricsSource. If desired you can configure that with a metrics > namespace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Priority: Blocker (was: Trivial) > StreamingSource duplicates appName > -- > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Blocker > > The StreamingSource includes the appName in its sourceName. This is not > desired for people using a custom namespace for metrics reporting using > {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} > cannot be excluded in the name of the metric. Using a metrics namespace > results in a duplicated indicator for {{spark.app.name}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName
[ https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Neumann updated SPARK-36360: --- Priority: Trivial (was: Minor) > StreamingSource duplicates appName > -- > > Key: SPARK-36360 > URL: https://issues.apache.org/jira/browse/SPARK-36360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Marcel Neumann >Priority: Trivial > > The StreamingSource includes the appName in its sourceName. This is not > desired for people using a custom namespace for metrics reporting using > {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} > cannot be excluded in the name of the metric. Using a metrics namespace > results in a duplicated indicator for {{spark.app.name}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
[ https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-36478: Description: Removes outer join if all grouping and aggregate expressions are from the streamed side. For example: {code:java} spark.range(200L).selectExpr("id AS a", "id as b", "id as c").createTempView("t1") spark.range(300L).selectExpr("id AS a", "id as b", "id as c").createTempView("t2") spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a GROUP BY t1.b").explain(true) {code} Current optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] +- Project [b#3L, c#4L] +- Join LeftOuter, (a#2L = a#10L) :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] : +- Range (0, 200, step=1, splits=Some(1)) +- Project [id#8L AS a#10L] +- Range (0, 300, step=1, splits=Some(1)) {code} Expected optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] +- Project [id#274L AS b#277L, id#274L AS c#278L] +- Range (0, 200, step=1, splits=Some(2)) {code} was: Removes outer join if all grouping and aggregate expressions are from the streamed side. For example: {code:java} spark.range(200L).selectExpr("id AS a").createTempView("t1") spark.range(300L).selectExpr("id AS b").createTempView("t2") spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true){code} Current optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] +- Project [b#3L, c#4L] +- Join LeftOuter, (a#2L = a#10L) :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] : +- Range (0, 200, step=1, splits=Some(1)) +- Project [id#8L AS a#10L] +- Range (0, 300, step=1, splits=Some(1)) {code} Expected optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] +- Project [id#274L AS b#277L, id#274L AS c#278L] +- Range (0, 200, step=1, splits=Some(2)) {code} > Removes outer join if all grouping and aggregate expressions are from the > streamed side > --- > > Key: SPARK-36478 > URL: https://issues.apache.org/jira/browse/SPARK-36478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wan Kun >Priority: Minor > > Removes outer join if all grouping and aggregate expressions are from the > streamed side. > For example: > {code:java} > spark.range(200L).selectExpr("id AS a", "id as b", "id as > c").createTempView("t1") > spark.range(300L).selectExpr("id AS a", "id as b", "id as > c").createTempView("t2") > spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a > GROUP BY t1.b").explain(true) > {code} > Current optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] > +- Project [b#3L, c#4L] >+- Join LeftOuter, (a#2L = a#10L) > :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] > : +- Range (0, 200, step=1, splits=Some(1)) > +- Project [id#8L AS a#10L] > +- Range (0, 300, step=1, splits=Some(1)) > {code} > Expected optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] > +- Project [id#274L AS b#277L, id#274L AS c#278L] >+- Range (0, 200, step=1, splits=Some(2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35030) ANSI SQL compliance
[ https://issues.apache.org/jira/browse/SPARK-35030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-35030: -- Assignee: Apache Spark > ANSI SQL compliance > --- > > Key: SPARK-35030 > URL: https://issues.apache.org/jira/browse/SPARK-35030 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Build an ANSI compliant dialect in Spark, for better data quality and easier > migration from traditional DBMS to Spark. For example, Spark will throw an > exception at runtime instead of returning null results when the inputs to a > SQL operator/function are invalid. > The new dialect is controlled by SQL Configuration `spark.sql.ansi.enabled`: > {code:java} > -- `spark.sql.ansi.enabled=true` > SELECT 2147483647 + 1; > java.lang.ArithmeticException: integer overflow > -- `spark.sql.ansi.enabled=false` > SELECT 2147483647 + 1; > ++ > |(2147483647 + 1)| > ++ > | -2147483648| > ++ > {code} > Full details of this dialect are documented in > [https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html|https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html]. > Note that some ANSI dialect features maybe not from the ANSI SQL standard > directly, but their behaviors align with ANSI SQL's style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern
[ https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36418: Assignee: (was: Apache Spark) > Use CAST in parsing of dates/timestamps with default pattern > > > Key: SPARK-36418 > URL: https://issues.apache.org/jira/browse/SPARK-36418 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > In functions, CSV/JSON datasources and other places, when the pattern is > default, use CAST logic in parsing strings to dates/timestamps. > Currently, TimestampFormatter.getFormatter() applies the default pattern > *-MM-dd HH:mm:ss* when the pattern is not set, see > https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344 > . Instead of that, need to create a special formatter which invokes the cast > logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern
[ https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36418: Assignee: Apache Spark > Use CAST in parsing of dates/timestamps with default pattern > > > Key: SPARK-36418 > URL: https://issues.apache.org/jira/browse/SPARK-36418 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > In functions, CSV/JSON datasources and other places, when the pattern is > default, use CAST logic in parsing strings to dates/timestamps. > Currently, TimestampFormatter.getFormatter() applies the default pattern > *-MM-dd HH:mm:ss* when the pattern is not set, see > https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344 > . Instead of that, need to create a special formatter which invokes the cast > logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern
[ https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397376#comment-17397376 ] Apache Spark commented on SPARK-36418: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33709 > Use CAST in parsing of dates/timestamps with default pattern > > > Key: SPARK-36418 > URL: https://issues.apache.org/jira/browse/SPARK-36418 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > In functions, CSV/JSON datasources and other places, when the pattern is > default, use CAST logic in parsing strings to dates/timestamps. > Currently, TimestampFormatter.getFormatter() applies the default pattern > *-MM-dd HH:mm:ss* when the pattern is not set, see > https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344 > . Instead of that, need to create a special formatter which invokes the cast > logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36474) Mention pandas API on Spark in Spark overview pages
[ https://issues.apache.org/jira/browse/SPARK-36474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36474: Assignee: Hyukjin Kwon > Mention pandas API on Spark in Spark overview pages > --- > > Key: SPARK-36474 > URL: https://issues.apache.org/jira/browse/SPARK-36474 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > > We can mention that https://spark.apache.org/docs/latest/index.html as an > example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36474) Mention pandas API on Spark in Spark overview pages
[ https://issues.apache.org/jira/browse/SPARK-36474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36474. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33699 [https://github.com/apache/spark/pull/33699] > Mention pandas API on Spark in Spark overview pages > --- > > Key: SPARK-36474 > URL: https://issues.apache.org/jira/browse/SPARK-36474 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 3.3.0 > > > We can mention that https://spark.apache.org/docs/latest/index.html as an > example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-36378. -- Fix Version/s: 3.3.0 3.2.0 Assignee: Min Shen Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/33613 > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Min Shen >Priority: Major > Fix For: 3.2.0, 3.3.0 > > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397336#comment-17397336 ] Apache Spark commented on SPARK-36480: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/33708 > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Critical > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36480: Assignee: Apache Spark > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Critical > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36480: Assignee: (was: Apache Spark) > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Critical > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
[ https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397334#comment-17397334 ] Apache Spark commented on SPARK-36480: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/33708 > SessionWindowStateStoreSaveExec should not filter input rows against watermark > -- > > Key: SPARK-36480 > URL: https://issues.apache.org/jira/browse/SPARK-36480 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Critical > > SessionWindowStateStoreSaveExec receives all sessions including existing > sessions into input rows and stores as they are. That said, we should not > filter out input rows before storing into state store, but we do. > Fortunately it hasn't showed any actual problem due to the nature how we deal > with watermark against micro-batch and it seems hard to come up with the > broken case, but it should be better to fix it before someone succeeds to > touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark
Jungtaek Lim created SPARK-36480: Summary: SessionWindowStateStoreSaveExec should not filter input rows against watermark Key: SPARK-36480 URL: https://issues.apache.org/jira/browse/SPARK-36480 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.2.0 Reporter: Jungtaek Lim SessionWindowStateStoreSaveExec receives all sessions including existing sessions into input rows and stores as they are. That said, we should not filter out input rows before storing into state store, but we do. Fortunately it hasn't showed any actual problem due to the nature how we deal with watermark against micro-batch and it seems hard to come up with the broken case, but it should be better to fix it before someone succeeds to touch the possible edge case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36132) Support initial state for flatMapGroupsWithState in batch mode
[ https://issues.apache.org/jira/browse/SPARK-36132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-36132. Resolution: Done This issue is resolved in https://github.com/apache/spark/pull/6 > Support initial state for flatMapGroupsWithState in batch mode > -- > > Key: SPARK-36132 > URL: https://issues.apache.org/jira/browse/SPARK-36132 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Rahul Shivu Mahadev >Assignee: Rahul Shivu Mahadev >Priority: Major > > SPARK-35897 added support for an initial state with flatMapGroupsWithState. > However this lacked support for initial state in batch mode of > flatMapGroupsWIthState. This Jira is for adding support for batch mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36132) Support initial state for flatMapGroupsWithState in batch mode
[ https://issues.apache.org/jira/browse/SPARK-36132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-36132: -- Assignee: Rahul Shivu Mahadev > Support initial state for flatMapGroupsWithState in batch mode > -- > > Key: SPARK-36132 > URL: https://issues.apache.org/jira/browse/SPARK-36132 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Rahul Shivu Mahadev >Assignee: Rahul Shivu Mahadev >Priority: Major > > SPARK-35897 added support for an initial state with flatMapGroupsWithState. > However this lacked support for initial state in batch mode of > flatMapGroupsWIthState. This Jira is for adding support for batch mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397301#comment-17397301 ] Cameron Todd commented on SPARK-18105: -- Ok I added the zip file on this public S3 bucket, it holds a parquet file with snappy compression, it's 1.75gb. You can retrieve the data as such {code:java} wget https://storage.gra.cloud.ovh.net/v1/AUTH_147b880980f148f5ad1af09542e6f37a/public_data/SPARK18105/hashed_data.zip {code} > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36413) Spark and Hive are inconsistent in inferring the expression type in the view, resulting in AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-36413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bingfeng.guo updated SPARK-36413: - Priority: Blocker (was: Critical) > Spark and Hive are inconsistent in inferring the expression type in the view, > resulting in AnalysisException > > > Key: SPARK-36413 > URL: https://issues.apache.org/jira/browse/SPARK-36413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 3.1.1 >Reporter: bingfeng.guo >Priority: Blocker > > The scene of the restoration is as follows: > > Suppose I have a hive table: > {quote}> desc sales; > id bigint > name varchar(4096) > price decimal(19,4) > {quote} > and I create a View: > {quote}CREATE VIEW `sales_view_bf` AS select `sales`.`id`, > case when `sales`.`name` = 'abc' then `sales`.`price`*1.1 else 0 end as > `price_new` > from `SALES`; > > desc sales_view_bf; > id bigint > price double > {quote} > then I query follow sql on hive with spark: > {quote}select PRICE_NEW from SALES_VIEW_BF; > {quote} > will throw: > {quote}Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast > `price_new` from decimal(22,5) to double as it may truncateCaused by: > org.apache.spark.sql.AnalysisException: Cannot up cast `price_new` from > decimal(22,5) to double as it may truncate; at > org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1$$anonfun$2.apply(view.scala:78) > at > org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1$$anonfun$2.apply(view.scala:72) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.immutable.List.map(List.scala:296) at > org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1.applyOrElse(view.scala:72) > at > org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1.applyOrElse(view.scala:51) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at >
[jira] [Commented] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
[ https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397279#comment-17397279 ] Apache Spark commented on SPARK-36478: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/33702 > Removes outer join if all grouping and aggregate expressions are from the > streamed side > --- > > Key: SPARK-36478 > URL: https://issues.apache.org/jira/browse/SPARK-36478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wan Kun >Priority: Minor > > Removes outer join if all grouping and aggregate expressions are from the > streamed side. > For example: > {code:java} > spark.range(200L).selectExpr("id AS a").createTempView("t1") > spark.range(300L).selectExpr("id AS b").createTempView("t2") > spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = > b").explain(true){code} > Current optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] > +- Project [b#3L, c#4L] >+- Join LeftOuter, (a#2L = a#10L) > :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] > : +- Range (0, 200, step=1, splits=Some(1)) > +- Project [id#8L AS a#10L] > +- Range (0, 300, step=1, splits=Some(1)) > {code} > Expected optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] > +- Project [id#274L AS b#277L, id#274L AS c#278L] >+- Range (0, 200, step=1, splits=Some(2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36479) improve datetime test coverage in SQL files
[ https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397278#comment-17397278 ] Apache Spark commented on SPARK-36479: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/33707 > improve datetime test coverage in SQL files > --- > > Key: SPARK-36479 > URL: https://issues.apache.org/jira/browse/SPARK-36479 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36479) improve datetime test coverage in SQL files
[ https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36479: Assignee: Apache Spark > improve datetime test coverage in SQL files > --- > > Key: SPARK-36479 > URL: https://issues.apache.org/jira/browse/SPARK-36479 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
[ https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397276#comment-17397276 ] Apache Spark commented on SPARK-36478: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/33702 > Removes outer join if all grouping and aggregate expressions are from the > streamed side > --- > > Key: SPARK-36478 > URL: https://issues.apache.org/jira/browse/SPARK-36478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wan Kun >Priority: Minor > > Removes outer join if all grouping and aggregate expressions are from the > streamed side. > For example: > {code:java} > spark.range(200L).selectExpr("id AS a").createTempView("t1") > spark.range(300L).selectExpr("id AS b").createTempView("t2") > spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = > b").explain(true){code} > Current optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] > +- Project [b#3L, c#4L] >+- Join LeftOuter, (a#2L = a#10L) > :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] > : +- Range (0, 200, step=1, splits=Some(1)) > +- Project [id#8L AS a#10L] > +- Range (0, 300, step=1, splits=Some(1)) > {code} > Expected optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] > +- Project [id#274L AS b#277L, id#274L AS c#278L] >+- Range (0, 200, step=1, splits=Some(2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
[ https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36478: Assignee: Apache Spark > Removes outer join if all grouping and aggregate expressions are from the > streamed side > --- > > Key: SPARK-36478 > URL: https://issues.apache.org/jira/browse/SPARK-36478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Minor > > Removes outer join if all grouping and aggregate expressions are from the > streamed side. > For example: > {code:java} > spark.range(200L).selectExpr("id AS a").createTempView("t1") > spark.range(300L).selectExpr("id AS b").createTempView("t2") > spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = > b").explain(true){code} > Current optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] > +- Project [b#3L, c#4L] >+- Join LeftOuter, (a#2L = a#10L) > :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] > : +- Range (0, 200, step=1, splits=Some(1)) > +- Project [id#8L AS a#10L] > +- Range (0, 300, step=1, splits=Some(1)) > {code} > Expected optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] > +- Project [id#274L AS b#277L, id#274L AS c#278L] >+- Range (0, 200, step=1, splits=Some(2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36479) improve datetime test coverage in SQL files
[ https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36479: Assignee: (was: Apache Spark) > improve datetime test coverage in SQL files > --- > > Key: SPARK-36479 > URL: https://issues.apache.org/jira/browse/SPARK-36479 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
[ https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36478: Assignee: (was: Apache Spark) > Removes outer join if all grouping and aggregate expressions are from the > streamed side > --- > > Key: SPARK-36478 > URL: https://issues.apache.org/jira/browse/SPARK-36478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wan Kun >Priority: Minor > > Removes outer join if all grouping and aggregate expressions are from the > streamed side. > For example: > {code:java} > spark.range(200L).selectExpr("id AS a").createTempView("t1") > spark.range(300L).selectExpr("id AS b").createTempView("t2") > spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = > b").explain(true){code} > Current optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] > +- Project [b#3L, c#4L] >+- Join LeftOuter, (a#2L = a#10L) > :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] > : +- Range (0, 200, step=1, splits=Some(1)) > +- Project [id#8L AS a#10L] > +- Range (0, 300, step=1, splits=Some(1)) > {code} > Expected optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] > +- Project [id#274L AS b#277L, id#274L AS c#278L] >+- Range (0, 200, step=1, splits=Some(2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36479) improve datetime test coverage in SQL files
Wenchen Fan created SPARK-36479: --- Summary: improve datetime test coverage in SQL files Key: SPARK-36479 URL: https://issues.apache.org/jira/browse/SPARK-36479 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
Wan Kun created SPARK-36478: --- Summary: Removes outer join if all grouping and aggregate expressions are from the streamed side Key: SPARK-36478 URL: https://issues.apache.org/jira/browse/SPARK-36478 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Wan Kun Removes outer join if all grouping and aggregate expressions are from the streamed side. For example: {code:java} spark.range(200L).selectExpr("id AS a").createTempView("t1") spark.range(300L).selectExpr("id AS b").createTempView("t2") spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true){code} Current optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] +- Project [b#3L, c#4L] +- Join LeftOuter, (a#2L = a#10L) :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] : +- Range (0, 200, step=1, splits=Some(1)) +- Project [id#8L AS a#10L] +- Range (0, 300, step=1, splits=Some(1)) {code} Expected optimized plan: {code:java} == Optimized Logical Plan == Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] +- Project [id#274L AS b#277L, id#274L AS c#278L] +- Range (0, 200, step=1, splits=Some(2)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36468) Update docs about ANSI interval literals
[ https://issues.apache.org/jira/browse/SPARK-36468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-36468. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33693 [https://github.com/apache/spark/pull/33693] > Update docs about ANSI interval literals > > > Key: SPARK-36468 > URL: https://issues.apache.org/jira/browse/SPARK-36468 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.2.0 > > > Update the page > https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36472) Improve SQL syntax for MERGE
[ https://issues.apache.org/jira/browse/SPARK-36472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Krivenko updated SPARK-36472: --- Description: Existing SQL syntax for *MEGRE* (see Delta Lake examples [here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge] and [here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into]) could be improved by adding an alternative for {{}} *Main assumption* In common cases target and source tables have the same column names used in {{}} as merge keys, for example: {code:sql} ON target.key1 = source.key1 AND target.key2 = source.key2{code} It would be more convenient to use a syntax similar to: {code:sql} ON COLUMNS (key1, key2) -- or ON MATCHING (key1, key2) {code} The same approach is used for [JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] where {{join_criteria}} syntax is {code:sql} ON boolean_expression | USING ( column_name [ , ... ] ) {code} *Improvement proposal* Syntax {code:sql} MERGE INTO target_table_identifier [AS target_alias] USING source_table_identifier [] [AS source_alias] ON { | COLUMNS ( column_name [ , ... ] ) } [ WHEN MATCHED [ AND ] THEN ] [ WHEN MATCHED [ AND ] THEN ] [ WHEN NOT MATCHED [ AND ] THEN ] {code} Example {code:sql} MERGE INTO target USING source ON COLUMNS (key1, key2) WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * {code} was: Existing SQL syntax for *MEGRE* (see Delta Lake examples [here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge] and [here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into]) could be improved by adding an alternative for {{}} *Main assumption* In common cases target and source tables have the same column names used in {{}} as merge keys, for example: {code:sql} ON target.key1 = source.key1 AND target.key2 = source.key2{code} It would be more convenient to use a syntax similar to: {code:sql} ON COLUMNS (key1, key2) {code} The same approach is used for [JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] where {{join_criteria}} syntax is {code:sql} ON boolean_expression | USING ( column_name [ , ... ] ) {code} *Improvement proposal* Syntax {code:sql} MERGE INTO target_table_identifier [AS target_alias] USING source_table_identifier [] [AS source_alias] ON { | COLUMNS ( column_name [ , ... ] ) } [ WHEN MATCHED [ AND ] THEN ] [ WHEN MATCHED [ AND ] THEN ] [ WHEN NOT MATCHED [ AND ] THEN ] {code} Example {code:sql} MERGE INTO target USING source ON COLUMNS (key1, key2) WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * {code} > Improve SQL syntax for MERGE > > > Key: SPARK-36472 > URL: https://issues.apache.org/jira/browse/SPARK-36472 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Denis Krivenko >Priority: Trivial > > Existing SQL syntax for *MEGRE* (see Delta Lake examples > [here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge] > and > [here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into]) > could be improved by adding an alternative for {{}} > *Main assumption* > In common cases target and source tables have the same column names used in > {{}} as merge keys, for example: > {code:sql} > ON target.key1 = source.key1 AND target.key2 = source.key2{code} > It would be more convenient to use a syntax similar to: > {code:sql} > ON COLUMNS (key1, key2) > -- or > ON MATCHING (key1, key2) > {code} > The same approach is used for > [JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] > where {{join_criteria}} syntax is > {code:sql} > ON boolean_expression | USING ( column_name [ , ... ] ) > {code} > *Improvement proposal* > Syntax > {code:sql} > MERGE INTO target_table_identifier [AS target_alias] > USING source_table_identifier [] [AS source_alias] > ON { | COLUMNS ( column_name [ , ... ] ) } > [ WHEN MATCHED [ AND ] THEN ] > [ WHEN MATCHED [ AND ] THEN ] > [ WHEN NOT MATCHED [ AND ] THEN ] > {code} > Example > {code:sql} > MERGE INTO target > USING source > ON COLUMNS (key1, key2) > WHEN MATCHED THEN > UPDATE SET * > WHEN NOT MATCHED THEN > INSERT * > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36353) RemoveNoopOperators should keep output schema
[ https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397236#comment-17397236 ] Apache Spark commented on SPARK-36353: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33705 > RemoveNoopOperators should keep output schema > - > > Key: SPARK-36353 > URL: https://issues.apache.org/jira/browse/SPARK-36353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2021-07-30-17-46-59-196.png > > > !image-2021-07-30-17-46-59-196.png|width=539,height=220! > [https://github.com/apache/spark/pull/33587] > > Only first level? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36353) RemoveNoopOperators should keep output schema
[ https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397238#comment-17397238 ] Apache Spark commented on SPARK-36353: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33704 > RemoveNoopOperators should keep output schema > - > > Key: SPARK-36353 > URL: https://issues.apache.org/jira/browse/SPARK-36353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > Attachments: image-2021-07-30-17-46-59-196.png > > > !image-2021-07-30-17-46-59-196.png|width=539,height=220! > [https://github.com/apache/spark/pull/33587] > > Only first level? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE
[ https://issues.apache.org/jira/browse/SPARK-36477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36477: Assignee: Apache Spark > Infering schema from JSON file shall respect ignoreCorruptFiles and handle > IOE > --- > > Key: SPARK-36477 > URL: https://issues.apache.org/jira/browse/SPARK-36477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE
[ https://issues.apache.org/jira/browse/SPARK-36477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397237#comment-17397237 ] Apache Spark commented on SPARK-36477: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/33706 > Infering schema from JSON file shall respect ignoreCorruptFiles and handle > IOE > --- > > Key: SPARK-36477 > URL: https://issues.apache.org/jira/browse/SPARK-36477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE
[ https://issues.apache.org/jira/browse/SPARK-36477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36477: Assignee: (was: Apache Spark) > Infering schema from JSON file shall respect ignoreCorruptFiles and handle > IOE > --- > > Key: SPARK-36477 > URL: https://issues.apache.org/jira/browse/SPARK-36477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE
Kent Yao created SPARK-36477: Summary: Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE Key: SPARK-36477 URL: https://issues.apache.org/jira/browse/SPARK-36477 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name
[ https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397206#comment-17397206 ] Apache Spark commented on SPARK-36352: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33703 > Spark should check result plan's output schema name > --- > > Key: SPARK-36352 > URL: https://issues.apache.org/jira/browse/SPARK-36352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name
[ https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397205#comment-17397205 ] Apache Spark commented on SPARK-36352: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33703 > Spark should check result plan's output schema name > --- > > Key: SPARK-36352 > URL: https://issues.apache.org/jira/browse/SPARK-36352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled
[ https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397152#comment-17397152 ] Apache Spark commented on SPARK-36475: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33700 > Add doc about spark.shuffle.service.fetch.rdd.enabled > - > > Key: SPARK-36475 > URL: https://issues.apache.org/jira/browse/SPARK-36475 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Add doc about spark.shuffle.service.fetch.rdd.enabled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled
[ https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397151#comment-17397151 ] Apache Spark commented on SPARK-36475: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/33700 > Add doc about spark.shuffle.service.fetch.rdd.enabled > - > > Key: SPARK-36475 > URL: https://issues.apache.org/jira/browse/SPARK-36475 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Add doc about spark.shuffle.service.fetch.rdd.enabled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled
[ https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36475: Assignee: (was: Apache Spark) > Add doc about spark.shuffle.service.fetch.rdd.enabled > - > > Key: SPARK-36475 > URL: https://issues.apache.org/jira/browse/SPARK-36475 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Add doc about spark.shuffle.service.fetch.rdd.enabled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled
[ https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36475: Assignee: Apache Spark > Add doc about spark.shuffle.service.fetch.rdd.enabled > - > > Key: SPARK-36475 > URL: https://issues.apache.org/jira/browse/SPARK-36475 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Add doc about spark.shuffle.service.fetch.rdd.enabled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34808) Removes outer join if it only has distinct on streamed side
[ https://issues.apache.org/jira/browse/SPARK-34808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397148#comment-17397148 ] Apache Spark commented on SPARK-34808: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/33702 > Removes outer join if it only has distinct on streamed side > --- > > Key: SPARK-34808 > URL: https://issues.apache.org/jira/browse/SPARK-34808 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > For example: > {code:scala} > spark.range(200L).selectExpr("id AS a").createTempView("t1") > spark.range(300L).selectExpr("id AS b").createTempView("t2") > spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true) > {code} > Current optimized plan: > {noformat} > == Optimized Logical Plan == > Aggregate [a#2L], [a#2L] > +- Project [a#2L] >+- Join LeftOuter, (a#2L = b#6L) > :- Project [id#0L AS a#2L] > : +- Range (0, 200, step=1, splits=Some(2)) > +- Project [id#4L AS b#6L] > +- Range (0, 300, step=1, splits=Some(2)) > {noformat} > Expected optimized plan: > {noformat} > == Optimized Logical Plan == > Aggregate [a#2L], [a#2L] > +- Project [id#0L AS a#2L] >+- Range (0, 200, step=1, splits=Some(2)) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org