[jira] [Commented] (SPARK-36387) Fix Series.astype from datetime to nullable string

2021-08-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397828#comment-17397828
 ] 

Hyukjin Kwon commented on SPARK-36387:
--

If he's inactive, let's go ahead. [~itholic]. We should better include all of 
them into Spark 3.2.

> Fix Series.astype from datetime to nullable string
> --
>
> Key: SPARK-36387
> URL: https://issues.apache.org/jira/browse/SPARK-36387
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
> Attachments: image-2021-08-12-14-24-31-321.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36387) Fix Series.astype from datetime to nullable string

2021-08-11 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397825#comment-17397825
 ] 

Haejoon Lee commented on SPARK-36387:
-

[~dc-heros], are you still working on this??

No need to rush, just for confirmation :)

> Fix Series.astype from datetime to nullable string
> --
>
> Key: SPARK-36387
> URL: https://issues.apache.org/jira/browse/SPARK-36387
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
> Attachments: image-2021-08-12-14-24-31-321.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36387) Fix Series.astype from datetime to nullable string

2021-08-11 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-36387:

Attachment: image-2021-08-12-14-24-31-321.png

> Fix Series.astype from datetime to nullable string
> --
>
> Key: SPARK-36387
> URL: https://issues.apache.org/jira/browse/SPARK-36387
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
> Attachments: image-2021-08-12-14-24-31-321.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36479) improve datetime test coverage in SQL files

2021-08-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36479.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33707
[https://github.com/apache/spark/pull/33707]

> improve datetime test coverage in SQL files
> ---
>
> Key: SPARK-36479
> URL: https://issues.apache.org/jira/browse/SPARK-36479
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36479) improve datetime test coverage in SQL files

2021-08-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36479:
---

Assignee: Wenchen Fan

> improve datetime test coverage in SQL files
> ---
>
> Key: SPARK-36479
> URL: https://issues.apache.org/jira/browse/SPARK-36479
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33807) Data Source V2: Remove read specific distributions

2021-08-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-33807:

Affects Version/s: (was: 3.2.0)
   3.3.0

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36486) Support type constructed string as dat time interval

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36486:


Assignee: (was: Apache Spark)

> Support type constructed string as dat time interval
> 
>
> Key: SPARK-36486
> URL: https://issues.apache.org/jira/browse/SPARK-36486
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36486) Support type constructed string as dat time interval

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36486:


Assignee: Apache Spark

> Support type constructed string as dat time interval
> 
>
> Key: SPARK-36486
> URL: https://issues.apache.org/jira/browse/SPARK-36486
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36486) Support type constructed string as dat time interval

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397804#comment-17397804
 ] 

Apache Spark commented on SPARK-36486:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33717

> Support type constructed string as dat time interval
> 
>
> Key: SPARK-36486
> URL: https://issues.apache.org/jira/browse/SPARK-36486
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36485) Support cast type constructed string to year month interval

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1739#comment-1739
 ] 

Apache Spark commented on SPARK-36485:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33716

> Support cast type constructed string to year month interval
> ---
>
> Key: SPARK-36485
> URL: https://issues.apache.org/jira/browse/SPARK-36485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> cast('1 year 2 month' as interval year to month)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36485) Support cast type constructed string to year month interval

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36485:


Assignee: (was: Apache Spark)

> Support cast type constructed string to year month interval
> ---
>
> Key: SPARK-36485
> URL: https://issues.apache.org/jira/browse/SPARK-36485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> cast('1 year 2 month' as interval year to month)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36485) Support cast type constructed string to year month interval

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36485:


Assignee: Apache Spark

> Support cast type constructed string to year month interval
> ---
>
> Key: SPARK-36485
> URL: https://issues.apache.org/jira/browse/SPARK-36485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> cast('1 year 2 month' as interval year to month)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36485) Support cast type constructed string to year month interval

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397778#comment-17397778
 ] 

Apache Spark commented on SPARK-36485:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33716

> Support cast type constructed string to year month interval
> ---
>
> Key: SPARK-36485
> URL: https://issues.apache.org/jira/browse/SPARK-36485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> cast('1 year 2 month' as interval year to month)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36486) Support type constructed string as dat time interval

2021-08-11 Thread angerszhu (Jira)
angerszhu created SPARK-36486:
-

 Summary: Support type constructed string as dat time interval
 Key: SPARK-36486
 URL: https://issues.apache.org/jira/browse/SPARK-36486
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-36480:
---

Assignee: Jungtaek Lim

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Critical
> Fix For: 3.2.0
>
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397775#comment-17397775
 ] 

L. C. Hsieh commented on SPARK-36480:
-

The issue was resolved at https://github.com/apache/spark/pull/33708.

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Critical
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-36480.
-
Resolution: Fixed

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Critical
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-36480:

Fix Version/s: 3.2.0

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Critical
> Fix For: 3.2.0
>
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36485) Support cast type constructed string to year month interval

2021-08-11 Thread angerszhu (Jira)
angerszhu created SPARK-36485:
-

 Summary: Support cast type constructed string to year month 
interval
 Key: SPARK-36485
 URL: https://issues.apache.org/jira/browse/SPARK-36485
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


cast('1 year 2 month' as interval year to month)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32986) Add bucket scan info in explain output of FileSourceScanExec

2021-08-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32986:
-

Assignee: Cheng Su

> Add bucket scan info in explain output of FileSourceScanExec
> 
>
> Key: SPARK-32986
> URL: https://issues.apache.org/jira/browse/SPARK-32986
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r493229395] , 
> currently for `FileSourceScanExec` explain output, there's no information to 
> indicate whether the table is read as bucketed table. And if not read as 
> bucketed table, what's the reason behind it. Add this info into 
> `FileSourceScanExec` explain output, can help users/developers understand 
> query plan more easily without spending a lot of time debugging why table is 
> not read as bucketed table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32986) Add bucket scan info in explain output of FileSourceScanExec

2021-08-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32986.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33698
[https://github.com/apache/spark/pull/33698]

> Add bucket scan info in explain output of FileSourceScanExec
> 
>
> Key: SPARK-32986
> URL: https://issues.apache.org/jira/browse/SPARK-32986
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.3.0
>
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r493229395] , 
> currently for `FileSourceScanExec` explain output, there's no information to 
> indicate whether the table is read as bucketed table. And if not read as 
> bucketed table, what's the reason behind it. Add this info into 
> `FileSourceScanExec` explain output, can help users/developers understand 
> query plan more easily without spending a lot of time debugging why table is 
> not read as bucketed table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-11 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397747#comment-17397747
 ] 

Dongjoon Hyun commented on SPARK-18105:
---

I did only the above code I posted because you wrote like this. What did you 
mean by `can be replaced with this` then?
{code:java}
//So this line of code:
...
// can be replaced with this
Dataset relevantPivots = 
spark.read().parquet(pathToDataDedup).repartition(5000).cache(); {code}

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34081) Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join

2021-08-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-34081:
---
Issue Type: Improvement  (was: Bug)

> Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as 
> broadcast join
> ---
>
> Key: SPARK-34081
> URL: https://issues.apache.org/jira/browse/SPARK-34081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Should not pushdown LeftSemi/LeftAnti over Aggregate for some cases.
> {code:scala}
> spark.range(5000L).selectExpr("id % 1 as a", "id % 1 as 
> b").write.saveAsTable("t1") spark.range(4000L).selectExpr("id % 8000 as 
> c", "id % 8000 as d").write.saveAsTable("t2")
> spark.sql("SELECT distinct a, b FROM t1 INTERSECT SELECT distinct c, d FROM 
> t2").explain
> {code}
> Current:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[a#16L, b#17L], functions=[])
>+- HashAggregate(keys=[a#16L, b#17L], functions=[])
>   +- HashAggregate(keys=[a#16L, b#17L], functions=[])
>  +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, 
> [id=#72]
> +- HashAggregate(keys=[a#16L, b#17L], functions=[])
>+- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), 
> coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), 
> coalesce(d#19L, 0), isnull(d#19L)], LeftSemi
>   :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) 
> ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS 
> FIRST], false, 0
>   :  +- Exchange hashpartitioning(coalesce(a#16L, 0), 
> isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, 
> [id=#65]
>   : +- FileScan parquet default.t1[a#16L,b#17L] Batched: 
> true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>   +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) 
> ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS 
> FIRST], false, 0
>  +- Exchange hashpartitioning(coalesce(c#18L, 0), 
> isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, 
> [id=#66]
> +- HashAggregate(keys=[c#18L, d#19L], functions=[])
>+- Exchange hashpartitioning(c#18L, d#19L, 5), 
> ENSURE_REQUIREMENTS, [id=#61]
>   +- HashAggregate(keys=[c#18L, d#19L], 
> functions=[])
>  +- FileScan parquet default.t2[c#18L,d#19L] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> {noformat}
>  
> Expected:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[a#16L, b#17L], functions=[])
>+- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, 
> [id=#74]
>   +- HashAggregate(keys=[a#16L, b#17L], functions=[])
>  +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 
> 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), 
> isnull(d#19L)], LeftSemi
> :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC 
> NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS 
> FIRST], false, 0
> :  +- Exchange hashpartitioning(coalesce(a#16L, 0), 
> isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, 
> [id=#67]
> : +- HashAggregate(keys=[a#16L, b#17L], functions=[])
> :+- Exchange hashpartitioning(a#16L, b#17L, 5), 
> ENSURE_REQUIREMENTS, [id=#61]
> :   +- HashAggregate(keys=[a#16L, b#17L], functions=[])
> :  +- FileScan parquet default.t1[a#16L,b#17L] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC 
> NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS 
> FIRST], false, 0
>+- Exchange hashpartitioning(coalesce(c#18L, 0), 
> isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), 

[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2021-08-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397729#comment-17397729
 ] 

Hyukjin Kwon commented on SPARK-32285:
--

Arrow version is 2.0.0 now. Can we fix this ticket?

> Add PySpark support for nested timestamps with arrow
> 
>
> Key: SPARK-32285
> URL: https://issues.apache.org/jira/browse/SPARK-32285
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with arrow optimizations, there is post-processing done in pandas 
> for timestamp columns to localize timezone. This is not done for nested 
> columns with timestamps such as StructType or ArrayType.
> Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use 
> of structs with timestamps in groupedby key over a window.
> As a simple first step, timestamps with 1 level nesting could be done first 
> and this will satisfy the immediate need.
> NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing 
> with pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-11 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397725#comment-17397725
 ] 

Mridul Muralidharan commented on SPARK-36378:
-

Thanks for marking this resolved [~Ngone51] ! The cherry pick messed up the 
merge script and I had to push it manually - which skipped the jira update :-)

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36484) Handle Stale block fetch failure on the client side by not retrying the requests

2021-08-11 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-36484:
---

 Summary: Handle Stale block fetch failure on the client side by 
not retrying the requests
 Key: SPARK-36484
 URL: https://issues.apache.org/jira/browse/SPARK-36484
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Venkata krishnan Sowrirajan


Similar to SPARK-36378, we need to handle the stale block fetch failures on the 
client side, although without handling it there won't be any correctness 
issues. This would help in saving some server side bandwidth unnecessarily 
serving stale shuffle fetch requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36481) Expose LogisticRegression.setInitialModel

2021-08-11 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36481.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33710
[https://github.com/apache/spark/pull/33710]

> Expose LogisticRegression.setInitialModel
> -
>
> Key: SPARK-36481
> URL: https://issues.apache.org/jira/browse/SPARK-36481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.3.0
>
>
> Several Spark ML components already allow setting of an initial model, 
> including KMeans, LogisticRegression, and GaussianMixture. This is useful to 
> begin training from a known reasonably good model.
> However, the method in LogisticRegression is private to Spark. I don't see a 
> good reason why it should be as the others in KMeans et al are not.
> None of these are exposed in Pyspark, which I don't necessarily want to 
> question or deal with now; there are other places one could arguably set an 
> initial model too, but, here just interested in exposing the existing, tested 
> functionality to callers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36483:


Assignee: Apache Spark

> Fix intermittent test failure due to netty dependency version bump
> --
>
> Key: SPARK-36483
> URL: https://issues.apache.org/jira/browse/SPARK-36483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to 
> 4.1.63.
> Since Netty version 4.1.52, a Netty specific 
> io.netty.channel.StacklessClosedChannelException gets thrown when Netty's 
> AbstractChannel encounters a closed channel.
> This can sometimes break the test 
> org.apache.spark.network.RPCIntegrationSuite as reported 
> [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].]
> This is due to the hardcoded list of exception messages to check in 
> RPCIntegrationSuite does not include this new StacklessClosedChannelException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36483:


Assignee: (was: Apache Spark)

> Fix intermittent test failure due to netty dependency version bump
> --
>
> Key: SPARK-36483
> URL: https://issues.apache.org/jira/browse/SPARK-36483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to 
> 4.1.63.
> Since Netty version 4.1.52, a Netty specific 
> io.netty.channel.StacklessClosedChannelException gets thrown when Netty's 
> AbstractChannel encounters a closed channel.
> This can sometimes break the test 
> org.apache.spark.network.RPCIntegrationSuite as reported 
> [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].]
> This is due to the hardcoded list of exception messages to check in 
> RPCIntegrationSuite does not include this new StacklessClosedChannelException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397639#comment-17397639
 ] 

Apache Spark commented on SPARK-36483:
--

User 'Victsm' has created a pull request for this issue:
https://github.com/apache/spark/pull/33713

> Fix intermittent test failure due to netty dependency version bump
> --
>
> Key: SPARK-36483
> URL: https://issues.apache.org/jira/browse/SPARK-36483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to 
> 4.1.63.
> Since Netty version 4.1.52, a Netty specific 
> io.netty.channel.StacklessClosedChannelException gets thrown when Netty's 
> AbstractChannel encounters a closed channel.
> This can sometimes break the test 
> org.apache.spark.network.RPCIntegrationSuite as reported 
> [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].]
> This is due to the hardcoded list of exception messages to check in 
> RPCIntegrationSuite does not include this new StacklessClosedChannelException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump

2021-08-11 Thread Min Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated SPARK-36483:
-
Description: 
In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to 
4.1.63.

Since Netty version 4.1.52, a Netty specific 
io.netty.channel.StacklessClosedChannelException gets thrown when Netty's 
AbstractChannel encounters a closed channel.

This can sometimes break the test org.apache.spark.network.RPCIntegrationSuite 
as reported 
[here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].]

This is due to the hardcoded list of exception messages to check in 
RPCIntegrationSuite does not include this new StacklessClosedChannelException.

> Fix intermittent test failure due to netty dependency version bump
> --
>
> Key: SPARK-36483
> URL: https://issues.apache.org/jira/browse/SPARK-36483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-35132, Spark's netty dependency version was bumped from 4.1.51 to 
> 4.1.63.
> Since Netty version 4.1.52, a Netty specific 
> io.netty.channel.StacklessClosedChannelException gets thrown when Netty's 
> AbstractChannel encounters a closed channel.
> This can sometimes break the test 
> org.apache.spark.network.RPCIntegrationSuite as reported 
> [here|[https://github.com/apache/spark/pull/33613#issuecomment-896697401].]
> This is due to the hardcoded list of exception messages to check in 
> RPCIntegrationSuite does not include this new StacklessClosedChannelException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36483) Fix intermittent test failure due to netty dependency version bump

2021-08-11 Thread Min Shen (Jira)
Min Shen created SPARK-36483:


 Summary: Fix intermittent test failure due to netty dependency 
version bump
 Key: SPARK-36483
 URL: https://issues.apache.org/jira/browse/SPARK-36483
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Min Shen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-11 Thread Cameron Todd (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397603#comment-17397603
 ] 

Cameron Todd commented on SPARK-18105:
--

Good to hear. Also the count of 136,935,074 is right. 

When you say "I downloaded and ran the test code in my laptop. It works well to 
me". You mean the java code I attached or do you mean the read file and count() 
that you just attached above. Because the error is only raised after the first 
few aggregations and joins in my code and I also doubt your laptop is big 
enough to process that ;) .

I can attach key config variables on your request but I'm using a cloud 
Kubernetes spark cluster managed by OVH (spark 3.0.1) so it's out of box 
solution managed by OVH not really me.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-11 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397585#comment-17397585
 ] 

Dongjoon Hyun commented on SPARK-18105:
---

BTW, I checked that a single zip file contains 157 snappy parquet files which 
is generated by Apache Spark 3.0.1.

{code}
$ ls -al /Users/dongjoon/data/SPARK-18105/hash_import.parquet | grep parquet | 
wc -l
 157

$ ls -al /Users/dongjoon/data/SPARK-18105/hash_import.parquet
total 5040768
drwxr-xr-x  160 dongjoon  staff  5120 Aug  9 02:55 .
drwxr-xr-x4 dongjoon  staff   128 Aug 11 12:32 ..
-rw-r--r--1 dongjoon  staff 0 Aug  9 02:21 _SUCCESS
-rw-r--r--1 dongjoon  staff   7098577 Aug  9 02:19 
part-0-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   9310705 Aug  9 02:19 
part-1-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   9286598 Aug  9 02:19 
part-2-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   900 Aug  9 02:19 
part-3-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   7174124 Aug  9 02:19 
part-4-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   7269784 Aug  9 02:19 
part-5-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   9045563 Aug  9 02:19 
part-6-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   8874048 Aug  9 02:19 
part-7-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   9088055 Aug  9 02:19 
part-8-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   7921907 Aug  9 02:19 
part-9-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   9734530 Aug  9 02:19 
part-00010-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  11430364 Aug  9 02:19 
part-00011-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  10106472 Aug  9 02:19 
part-00012-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  11672118 Aug  9 02:19 
part-00013-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  11077873 Aug  9 02:19 
part-00014-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  17459766 Aug  9 02:19 
part-00015-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  18000141 Aug  9 02:19 
part-00016-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  10688008 Aug  9 02:19 
part-00017-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  10876962 Aug  9 02:19 
part-00018-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   6001747 Aug  9 02:19 
part-00019-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  13051685 Aug  9 02:19 
part-00020-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  40607547 Aug  9 02:19 
part-00021-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  16122066 Aug  9 02:19 
part-00022-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   7858387 Aug  9 02:19 
part-00023-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   7953870 Aug  9 02:19 
part-00024-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  20777892 Aug  9 02:19 
part-00025-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   7785559 Aug  9 02:19 
part-00026-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   6101494 Aug  9 02:19 
part-00027-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  14878775 Aug  9 02:19 
part-00028-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   8807412 Aug  9 02:19 
part-00029-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff664925 Aug  9 02:19 
part-00030-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  36711578 Aug  9 02:19 
part-00031-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   6376692 Aug  9 02:19 
part-00032-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff   9121175 Aug  9 02:19 
part-00033-51e43836-6ce9-419d-b632-2a8ca8185b9b-c000.snappy.parquet
-rw-r--r--1 dongjoon  staff  12318653 Aug  9 02:19 

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-11 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397584#comment-17397584
 ] 

Dongjoon Hyun commented on SPARK-18105:
---

Thank you, [~cameron.todd]. I downloaded and ran the test code in my laptop. It 
works well to me. Do you have some other configurations with you?
{code}
SPARK-18105:$ du -h hash_import.parquet
2.4Ghash_import.parquet
{code}

{code}
scala> val df = 
spark.read.parquet("/Users/dongjoon/data/SPARK-18105/hash_import.parquet").repartition(5000).cache()
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, 
pivot_hash: int ... 1 more field]

scala> df.count()
res3: Long = 136935074

scala> spark.version
res4: String = 3.1.2
{code}

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2021-08-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34827:
--
Affects Version/s: 3.3.0

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2021-08-11 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397569#comment-17397569
 ] 

Dongjoon Hyun commented on SPARK-34827:
---

Thank you. I moved the Target Version to 3.3.0.

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2021-08-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34827:
--
Target Version/s: 3.3.0  (was: 3.2.0)

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397563#comment-17397563
 ] 

Apache Spark commented on SPARK-36378:
--

User 'Victsm' has created a pull request for this issue:
https://github.com/apache/spark/pull/33713

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36482) Bump orc to 1.6.10

2021-08-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36482.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33712
[https://github.com/apache/spark/pull/33712]

> Bump orc to 1.6.10
> --
>
> Key: SPARK-36482
> URL: https://issues.apache.org/jira/browse/SPARK-36482
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36482) Bump orc to 1.6.10

2021-08-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36482:
-

Assignee: William Hyun

> Bump orc to 1.6.10
> --
>
> Key: SPARK-36482
> URL: https://issues.apache.org/jira/browse/SPARK-36482
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36482) Bump orc to 1.6.10

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36482:


Assignee: Apache Spark

> Bump orc to 1.6.10
> --
>
> Key: SPARK-36482
> URL: https://issues.apache.org/jira/browse/SPARK-36482
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36482) Bump orc to 1.6.10

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397486#comment-17397486
 ] 

Apache Spark commented on SPARK-36482:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33712

> Bump orc to 1.6.10
> --
>
> Key: SPARK-36482
> URL: https://issues.apache.org/jira/browse/SPARK-36482
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36482) Bump orc to 1.6.10

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36482:


Assignee: (was: Apache Spark)

> Bump orc to 1.6.10
> --
>
> Key: SPARK-36482
> URL: https://issues.apache.org/jira/browse/SPARK-36482
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36482) Bump orc to 1.6.10

2021-08-11 Thread William Hyun (Jira)
William Hyun created SPARK-36482:


 Summary: Bump orc to 1.6.10
 Key: SPARK-36482
 URL: https://issues.apache.org/jira/browse/SPARK-36482
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36446) YARN shuffle server restart crashes all dynamic allocation jobs that have deallocated an executor

2021-08-11 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397461#comment-17397461
 ] 

Thomas Graves commented on SPARK-36446:
---

[~adamkennedy77] ^

> YARN shuffle server restart crashes all dynamic allocation jobs that have 
> deallocated an executor
> -
>
> Key: SPARK-36446
> URL: https://issues.apache.org/jira/browse/SPARK-36446
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Adam Kennedy
>Priority: Critical
>
> When dynamic allocation is enabled, executors that deallocate rely on the 
> shuffle server to hold blocks and supply them to remaining executors.
> When YARN Shuffle Server restarts (either intentionally or due to a crash), 
> it loses block information and relies on being able to contact Executors (the 
> locations of which it durably stores) to refetch the list of blocks.
> This mutual dependency on the other to hold block information fails fatally 
> under some common scenarios.
> For example, if a Spark application is running under dynamic allocation, some 
> amount of executors will almost always shut down.
> If, after this has occurred, any shuffle server crashes, or is restarted 
> (either directly when running as a standalone service, or as part of a YARN 
> node manager restart) then there is no way to restore block data and it is 
> permanently lost.
> Worse, when Executors try to fetch blocks from the shuffle server, the 
> shuffle server cannot location the exeutor, decides it doesn't exist, treats 
> it as a fatal exception, and causes the application to terminate and crash.
> Thus, in a real world scenario that we observe on a 1000+ node multi-tenant 
> cluster  where dynamic allocation is on by default, a rolling restart of the 
> YARN node managers will cause ALL jobs that have deallocated any executor and 
> have shuffles or transferred blocks to the shuffle server in order to shut 
> down, to crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36481) Expose LogisticRegression.setInitialModel

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36481:


Assignee: Sean R. Owen  (was: Apache Spark)

> Expose LogisticRegression.setInitialModel
> -
>
> Key: SPARK-36481
> URL: https://issues.apache.org/jira/browse/SPARK-36481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Several Spark ML components already allow setting of an initial model, 
> including KMeans, LogisticRegression, and GaussianMixture. This is useful to 
> begin training from a known reasonably good model.
> However, the method in LogisticRegression is private to Spark. I don't see a 
> good reason why it should be as the others in KMeans et al are not.
> None of these are exposed in Pyspark, which I don't necessarily want to 
> question or deal with now; there are other places one could arguably set an 
> initial model too, but, here just interested in exposing the existing, tested 
> functionality to callers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36481) Expose LogisticRegression.setInitialModel

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36481:


Assignee: Apache Spark  (was: Sean R. Owen)

> Expose LogisticRegression.setInitialModel
> -
>
> Key: SPARK-36481
> URL: https://issues.apache.org/jira/browse/SPARK-36481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Several Spark ML components already allow setting of an initial model, 
> including KMeans, LogisticRegression, and GaussianMixture. This is useful to 
> begin training from a known reasonably good model.
> However, the method in LogisticRegression is private to Spark. I don't see a 
> good reason why it should be as the others in KMeans et al are not.
> None of these are exposed in Pyspark, which I don't necessarily want to 
> question or deal with now; there are other places one could arguably set an 
> initial model too, but, here just interested in exposing the existing, tested 
> functionality to callers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36481) Expose LogisticRegression.setInitialModel

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397445#comment-17397445
 ] 

Apache Spark commented on SPARK-36481:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/33710

> Expose LogisticRegression.setInitialModel
> -
>
> Key: SPARK-36481
> URL: https://issues.apache.org/jira/browse/SPARK-36481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Several Spark ML components already allow setting of an initial model, 
> including KMeans, LogisticRegression, and GaussianMixture. This is useful to 
> begin training from a known reasonably good model.
> However, the method in LogisticRegression is private to Spark. I don't see a 
> good reason why it should be as the others in KMeans et al are not.
> None of these are exposed in Pyspark, which I don't necessarily want to 
> question or deal with now; there are other places one could arguably set an 
> initial model too, but, here just interested in exposing the existing, tested 
> functionality to callers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36481) Expose LogisticRegression.setInitialModel

2021-08-11 Thread Sean R. Owen (Jira)
Sean R. Owen created SPARK-36481:


 Summary: Expose LogisticRegression.setInitialModel
 Key: SPARK-36481
 URL: https://issues.apache.org/jira/browse/SPARK-36481
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.2.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


Several Spark ML components already allow setting of an initial model, 
including KMeans, LogisticRegression, and GaussianMixture. This is useful to 
begin training from a known reasonably good model.

However, the method in LogisticRegression is private to Spark. I don't see a 
good reason why it should be as the others in KMeans et al are not.

None of these are exposed in Pyspark, which I don't necessarily want to 
question or deal with now; there are other places one could arguably set an 
initial model too, but, here just interested in exposing the existing, tested 
functionality to callers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Description: 
The StreamingSource includes the appName in its sourceName. However, the 
appName should not be handled by the StreamingSource. It is already handled by 
the MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource.

Why is this important? See this part from the 
[documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]:

??Often times, users want to be able to track the metrics across apps for 
driver and executors, which is hard to do with application ID (i.e. 
{{spark.app.id}}) since it changes with every invocation of the app. For such 
use cases, a custom namespace can be specified for metrics reporting using 
{{spark.metrics.namespace}} configuration property. If, say, users wanted to 
set the metrics namespace to the name of the application, they can set the 
{{spark.metrics.namespace}} property to a value like {{$\{spark.app.name}}}. 
This value is then expanded appropriately by Spark and is used as the root 
namespace of the metrics system.??

This is only possible if the MetricsSystem handles the namespace which it does. 
But the StreamingSource additionally adds the appName in its sourceName, thus 
there is no way to configure a namespace that does not include the appName.

 

  was:
The StreamingSource includes the appName in its sourceName. The appName should 
not be handled by the StreamingSource. It is already handled by the 
MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource.

Why is this important? See this part from the 
[documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]:

??Often times, users want to be able to track the metrics across apps for 
driver and executors, which is hard to do with application ID (i.e. 
{{spark.app.id}}) since it changes with every invocation of the app. For such 
use cases, a custom namespace can be specified for metrics reporting using 
{{spark.metrics.namespace}} configuration property. If, say, users wanted to 
set the metrics namespace to the name of the application, they can set the 
{{spark.metrics.namespace}} property to a value like {{${spark.app.name}}}. 
This value is then expanded appropriately by Spark and is used as the root 
namespace of the metrics system.??

This is only possible if the MetricsSystem handles the namespace which it does. 
But the StreamingSource additionally adds the appName in its sourceName, thus 
there is no way to configure a namespace that does not include the appName.

 


> Delete appName from StreamingSource 
> 
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Blocker
>
> The StreamingSource includes the appName in its sourceName. However, the 
> appName should not be handled by the StreamingSource. It is already handled 
> by the MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource.
> Why is this important? See this part from the 
> [documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]:
> ??Often times, users want to be able to track the metrics across apps for 
> driver and executors, which is hard to do with application ID (i.e. 
> {{spark.app.id}}) since it changes with every invocation of the app. For such 
> use cases, a custom namespace can be specified for metrics reporting using 
> {{spark.metrics.namespace}} configuration property. If, say, users wanted to 
> set the metrics namespace to the name of the application, they can set the 
> {{spark.metrics.namespace}} property to a value like {{$\{spark.app.name}}}. 
> This value is then expanded appropriately by Spark and is used as the root 
> namespace of the metrics system.??
> This is only possible if the MetricsSystem handles the namespace which it 
> does. But the StreamingSource additionally adds the appName in its 
> sourceName, thus there is no way to configure a namespace that does not 
> include the appName.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Description: 
The StreamingSource includes the appName in its sourceName. The appName should 
not be handled by the StreamingSource. It is already handled by the 
MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource.

Why is this important? See this part from the 
[documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]:

??Often times, users want to be able to track the metrics across apps for 
driver and executors, which is hard to do with application ID (i.e. 
{{spark.app.id}}) since it changes with every invocation of the app. For such 
use cases, a custom namespace can be specified for metrics reporting using 
{{spark.metrics.namespace}} configuration property. If, say, users wanted to 
set the metrics namespace to the name of the application, they can set the 
{{spark.metrics.namespace}} property to a value like {{${spark.app.name}}}. 
This value is then expanded appropriately by Spark and is used as the root 
namespace of the metrics system.??

This is only possible if the MetricsSystem handles the namespace which it does. 
But the StreamingSource additionally adds the appName in its sourceName, thus 
there is no way to configure a namespace that does not include the appName.

 

  was:
The StreamingSource includes the appName in its sourceName. This is not desired 
when using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
cannot be excluded in the name of the metric. Using a metrics namespace results 
in a duplicated indicator for {{spark.app.name}}.

All other metrics do not include the appName in their sourceName, e.g. 
ExecutorMetricsSource. If desired you can configure that with a metrics 
namespace. The same should be done in StreamingSource.


> Delete appName from StreamingSource 
> 
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Blocker
>
> The StreamingSource includes the appName in its sourceName. The appName 
> should not be handled by the StreamingSource. It is already handled by the 
> MetricsSystem. See all other MetricSources, e.g. ExecutorMetricsSource.
> Why is this important? See this part from the 
> [documentation|https://spark.apache.org/docs/latest/monitoring.html#metrics]:
> ??Often times, users want to be able to track the metrics across apps for 
> driver and executors, which is hard to do with application ID (i.e. 
> {{spark.app.id}}) since it changes with every invocation of the app. For such 
> use cases, a custom namespace can be specified for metrics reporting using 
> {{spark.metrics.namespace}} configuration property. If, say, users wanted to 
> set the metrics namespace to the name of the application, they can set the 
> {{spark.metrics.namespace}} property to a value like {{${spark.app.name}}}. 
> This value is then expanded appropriately by Spark and is used as the root 
> namespace of the metrics system.??
> This is only possible if the MetricsSystem handles the namespace which it 
> does. But the StreamingSource additionally adds the appName in its 
> sourceName, thus there is no way to configure a namespace that does not 
> include the appName.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2021-08-11 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-36478:

Description: 
Removes outer join if all grouping and aggregate expressions are from the 
streamed side.

For example:
{code:java}
spark.range(200L).selectExpr("id AS a", "id as b", "id as 
c").createTempView("t1")
spark.range(300L).selectExpr("id AS a").createTempView("t2")
spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a 
GROUP BY t1.b").explain(true)
{code}
Current optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
+- Project [b#3L, c#4L]
   +- Join LeftOuter, (a#2L = a#10L)
  :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
  :  +- Range (0, 200, step=1, splits=Some(1))
  +- Project [id#8L AS a#10L]
 +- Range (0, 300, step=1, splits=Some(1))
{code}
Expected optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
+- Project [id#274L AS b#277L, id#274L AS c#278L]
   +- Range (0, 200, step=1, splits=Some(2))
{code}

  was:
Removes outer join if all grouping and aggregate expressions are from the 
streamed side.

For example:
{code:java}
spark.range(200L).selectExpr("id AS a", "id as b", "id as 
c").createTempView("t1")
spark.range(300L).selectExpr("id AS a", "id as b", "id as 
c").createTempView("t2")
spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a 
GROUP BY t1.b").explain(true)
{code}
Current optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
+- Project [b#3L, c#4L]
   +- Join LeftOuter, (a#2L = a#10L)
  :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
  :  +- Range (0, 200, step=1, splits=Some(1))
  +- Project [id#8L AS a#10L]
 +- Range (0, 300, step=1, splits=Some(1))
{code}
Expected optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
+- Project [id#274L AS b#277L, id#274L AS c#278L]
   +- Range (0, 200, step=1, splits=Some(2))
{code}


> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side
> ---
>
> Key: SPARK-36478
> URL: https://issues.apache.org/jira/browse/SPARK-36478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wan Kun
>Priority: Minor
>
> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side.
> For example:
> {code:java}
> spark.range(200L).selectExpr("id AS a", "id as b", "id as 
> c").createTempView("t1")
> spark.range(300L).selectExpr("id AS a").createTempView("t2")
> spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a 
> GROUP BY t1.b").explain(true)
> {code}
> Current optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
> +- Project [b#3L, c#4L]
>+- Join LeftOuter, (a#2L = a#10L)
>   :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
>   :  +- Range (0, 200, step=1, splits=Some(1))
>   +- Project [id#8L AS a#10L]
>  +- Range (0, 300, step=1, splits=Some(1))
> {code}
> Expected optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
> +- Project [id#274L AS b#277L, id#274L AS c#278L]
>+- Range (0, 200, step=1, splits=Some(2))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Description: 
The StreamingSource includes the appName in its sourceName. This is not desired 
when using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
cannot be excluded in the name of the metric. Using a metrics namespace results 
in a duplicated indicator for {{spark.app.name}}.

All other metrics do not include the appName in their sourceName, e.g. 
ExecutorMetricsSource. If desired you can configure that with a metrics 
namespace. The same should be done in StreamingSource.

  was:
The StreamingSource includes the appName in its sourceName. This is not desired 
for people using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
cannot be excluded in the name of the metric. Using a metrics namespace results 
in a duplicated indicator for {{spark.app.name}}.

All other metrics do not include the appName in their sourceName, e.g. 
ExecutorMetricsSource. If desired you can configure that with a metrics 
namespace. The same should be done in StreamingSource.


> Delete appName from StreamingSource 
> 
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Blocker
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired when using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> cannot be excluded in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.
> All other metrics do not include the appName in their sourceName, e.g. 
> ExecutorMetricsSource. If desired you can configure that with a metrics 
> namespace. The same should be done in StreamingSource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) Delete appName from StreamingSource

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Summary: Delete appName from StreamingSource   (was: StreamingSource 
duplicates appName)

> Delete appName from StreamingSource 
> 
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Blocker
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> cannot be excluded in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.
> All other metrics do not include the appName in their sourceName, e.g. 
> ExecutorMetricsSource. If desired you can configure that with a metrics 
> namespace. The same should be done in StreamingSource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Description: 
The StreamingSource includes the appName in its sourceName. This is not desired 
for people using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
cannot be excluded in the name of the metric. Using a metrics namespace results 
in a duplicated indicator for {{spark.app.name}}.

All other metrics do not include the appName in their sourceName, e.g. 
ExecutorMetricsSource. If desired you can configure that with a metrics 
namespace. The same should be done in StreamingSource.

  was:
The StreamingSource includes the appName in its sourceName. This is not desired 
for people using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
cannot be excluded in the name of the metric. Using a metrics namespace results 
in a duplicated indicator for {{spark.app.name}}.

All other metrics do not include the appName in their sourceName, e.g. 
ExecutorMetricsSource. If desired you can configure that with a metrics 
namespace.


> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Blocker
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> cannot be excluded in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.
> All other metrics do not include the appName in their sourceName, e.g. 
> ExecutorMetricsSource. If desired you can configure that with a metrics 
> namespace. The same should be done in StreamingSource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Description: 
The StreamingSource includes the appName in its sourceName. This is not desired 
for people using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
cannot be excluded in the name of the metric. Using a metrics namespace results 
in a duplicated indicator for {{spark.app.name}}.

All other metrics do not include the appName in their sourceName, e.g. 
ExecutorMetricsSource. If desired you can configure that with a metrics 
namespace.

  was:The StreamingSource includes the appName in its sourceName. This is not 
desired for people using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
cannot be excluded in the name of the metric. Using a metrics namespace results 
in a duplicated indicator for {{spark.app.name}}.


> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Blocker
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> cannot be excluded in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.
> All other metrics do not include the appName in their sourceName, e.g. 
> ExecutorMetricsSource. If desired you can configure that with a metrics 
> namespace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Priority: Blocker  (was: Trivial)

> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Blocker
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> cannot be excluded in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName

2021-08-11 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Priority: Trivial  (was: Minor)

> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Trivial
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> cannot be excluded in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2021-08-11 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-36478:

Description: 
Removes outer join if all grouping and aggregate expressions are from the 
streamed side.

For example:
{code:java}
spark.range(200L).selectExpr("id AS a", "id as b", "id as 
c").createTempView("t1")
spark.range(300L).selectExpr("id AS a", "id as b", "id as 
c").createTempView("t2")
spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a 
GROUP BY t1.b").explain(true)
{code}
Current optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
+- Project [b#3L, c#4L]
   +- Join LeftOuter, (a#2L = a#10L)
  :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
  :  +- Range (0, 200, step=1, splits=Some(1))
  +- Project [id#8L AS a#10L]
 +- Range (0, 300, step=1, splits=Some(1))
{code}
Expected optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
+- Project [id#274L AS b#277L, id#274L AS c#278L]
   +- Range (0, 200, step=1, splits=Some(2))
{code}

  was:
Removes outer join if all grouping and aggregate expressions are from the 
streamed side.

For example:
{code:java}
spark.range(200L).selectExpr("id AS a").createTempView("t1")
spark.range(300L).selectExpr("id AS b").createTempView("t2")
spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true){code}
Current optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
+- Project [b#3L, c#4L]
   +- Join LeftOuter, (a#2L = a#10L)
  :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
  :  +- Range (0, 200, step=1, splits=Some(1))
  +- Project [id#8L AS a#10L]
 +- Range (0, 300, step=1, splits=Some(1))
{code}
Expected optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
+- Project [id#274L AS b#277L, id#274L AS c#278L]
   +- Range (0, 200, step=1, splits=Some(2))
{code}


> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side
> ---
>
> Key: SPARK-36478
> URL: https://issues.apache.org/jira/browse/SPARK-36478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wan Kun
>Priority: Minor
>
> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side.
> For example:
> {code:java}
> spark.range(200L).selectExpr("id AS a", "id as b", "id as 
> c").createTempView("t1")
> spark.range(300L).selectExpr("id AS a", "id as b", "id as 
> c").createTempView("t2")
> spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a 
> GROUP BY t1.b").explain(true)
> {code}
> Current optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
> +- Project [b#3L, c#4L]
>+- Join LeftOuter, (a#2L = a#10L)
>   :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
>   :  +- Range (0, 200, step=1, splits=Some(1))
>   +- Project [id#8L AS a#10L]
>  +- Range (0, 300, step=1, splits=Some(1))
> {code}
> Expected optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
> +- Project [id#274L AS b#277L, id#274L AS c#278L]
>+- Range (0, 200, step=1, splits=Some(2))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35030) ANSI SQL compliance

2021-08-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-35030:
--

Assignee: Apache Spark

> ANSI SQL compliance
> ---
>
> Key: SPARK-35030
> URL: https://issues.apache.org/jira/browse/SPARK-35030
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Build an ANSI compliant dialect in Spark, for better data quality and easier 
> migration from traditional DBMS to Spark. For example, Spark will throw an 
> exception at runtime instead of returning null results when the inputs to a 
> SQL operator/function are invalid. 
> The new dialect is controlled by SQL Configuration `spark.sql.ansi.enabled`:
> {code:java}
> -- `spark.sql.ansi.enabled=true`
> SELECT 2147483647 + 1;
> java.lang.ArithmeticException: integer overflow
> -- `spark.sql.ansi.enabled=false`
> SELECT 2147483647 + 1;
> ++
> |(2147483647 + 1)|
> ++
> | -2147483648|
> ++
> {code}
> Full details of this dialect are documented in 
> [https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html|https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html].
> Note that some ANSI dialect features maybe not from the ANSI SQL standard 
> directly, but their behaviors align with ANSI SQL's style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36418:


Assignee: (was: Apache Spark)

> Use CAST in parsing of dates/timestamps with default pattern
> 
>
> Key: SPARK-36418
> URL: https://issues.apache.org/jira/browse/SPARK-36418
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> In functions, CSV/JSON datasources and other places, when the pattern is 
> default, use CAST logic in parsing strings to dates/timestamps.
> Currently, TimestampFormatter.getFormatter() applies the default pattern 
> *-MM-dd  HH:mm:ss* when the pattern is not set, see 
> https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344
>  . Instead of that, need to create a special formatter which invokes the cast 
> logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36418:


Assignee: Apache Spark

> Use CAST in parsing of dates/timestamps with default pattern
> 
>
> Key: SPARK-36418
> URL: https://issues.apache.org/jira/browse/SPARK-36418
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> In functions, CSV/JSON datasources and other places, when the pattern is 
> default, use CAST logic in parsing strings to dates/timestamps.
> Currently, TimestampFormatter.getFormatter() applies the default pattern 
> *-MM-dd  HH:mm:ss* when the pattern is not set, see 
> https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344
>  . Instead of that, need to create a special formatter which invokes the cast 
> logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36418) Use CAST in parsing of dates/timestamps with default pattern

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397376#comment-17397376
 ] 

Apache Spark commented on SPARK-36418:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33709

> Use CAST in parsing of dates/timestamps with default pattern
> 
>
> Key: SPARK-36418
> URL: https://issues.apache.org/jira/browse/SPARK-36418
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> In functions, CSV/JSON datasources and other places, when the pattern is 
> default, use CAST logic in parsing strings to dates/timestamps.
> Currently, TimestampFormatter.getFormatter() applies the default pattern 
> *-MM-dd  HH:mm:ss* when the pattern is not set, see 
> https://github.com/apache/spark/blob/f2492772baf1d00d802e704f84c22a9c410929e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala#L344
>  . Instead of that, need to create a special formatter which invokes the cast 
> logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36474) Mention pandas API on Spark in Spark overview pages

2021-08-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36474:


Assignee: Hyukjin Kwon

> Mention pandas API on Spark in Spark overview pages
> ---
>
> Key: SPARK-36474
> URL: https://issues.apache.org/jira/browse/SPARK-36474
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> We can mention that https://spark.apache.org/docs/latest/index.html as an 
> example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36474) Mention pandas API on Spark in Spark overview pages

2021-08-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36474.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33699
[https://github.com/apache/spark/pull/33699]

> Mention pandas API on Spark in Spark overview pages
> ---
>
> Key: SPARK-36474
> URL: https://issues.apache.org/jira/browse/SPARK-36474
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.3.0
>
>
> We can mention that https://spark.apache.org/docs/latest/index.html as an 
> example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-11 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-36378.
--
Fix Version/s: 3.3.0
   3.2.0
 Assignee: Min Shen
   Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/33613

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397336#comment-17397336
 ] 

Apache Spark commented on SPARK-36480:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/33708

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Critical
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36480:


Assignee: Apache Spark

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Critical
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36480:


Assignee: (was: Apache Spark)

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Critical
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397334#comment-17397334
 ] 

Apache Spark commented on SPARK-36480:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/33708

> SessionWindowStateStoreSaveExec should not filter input rows against watermark
> --
>
> Key: SPARK-36480
> URL: https://issues.apache.org/jira/browse/SPARK-36480
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Critical
>
> SessionWindowStateStoreSaveExec receives all sessions including existing 
> sessions into input rows and stores as they are. That said, we should not 
> filter out input rows before storing into state store, but we do. 
> Fortunately it hasn't showed any actual problem due to the nature how we deal 
> with watermark against micro-batch and it seems hard to come up with the 
> broken case, but it should be better to fix it before someone succeeds to 
> touch the possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36480) SessionWindowStateStoreSaveExec should not filter input rows against watermark

2021-08-11 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-36480:


 Summary: SessionWindowStateStoreSaveExec should not filter input 
rows against watermark
 Key: SPARK-36480
 URL: https://issues.apache.org/jira/browse/SPARK-36480
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Jungtaek Lim


SessionWindowStateStoreSaveExec receives all sessions including existing 
sessions into input rows and stores as they are. That said, we should not 
filter out input rows before storing into state store, but we do. 

Fortunately it hasn't showed any actual problem due to the nature how we deal 
with watermark against micro-batch and it seems hard to come up with the broken 
case, but it should be better to fix it before someone succeeds to touch the 
possible edge case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36132) Support initial state for flatMapGroupsWithState in batch mode

2021-08-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36132.

Resolution: Done

This issue is resolved in https://github.com/apache/spark/pull/6

> Support initial state for flatMapGroupsWithState in batch mode
> --
>
> Key: SPARK-36132
> URL: https://issues.apache.org/jira/browse/SPARK-36132
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Rahul Shivu Mahadev
>Assignee: Rahul Shivu Mahadev
>Priority: Major
>
> SPARK-35897 added support for an initial state with flatMapGroupsWithState. 
> However this lacked support for initial state in batch mode of 
> flatMapGroupsWIthState. This Jira is for adding support for batch mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36132) Support initial state for flatMapGroupsWithState in batch mode

2021-08-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36132:
--

Assignee: Rahul Shivu Mahadev

> Support initial state for flatMapGroupsWithState in batch mode
> --
>
> Key: SPARK-36132
> URL: https://issues.apache.org/jira/browse/SPARK-36132
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Rahul Shivu Mahadev
>Assignee: Rahul Shivu Mahadev
>Priority: Major
>
> SPARK-35897 added support for an initial state with flatMapGroupsWithState. 
> However this lacked support for initial state in batch mode of 
> flatMapGroupsWIthState. This Jira is for adding support for batch mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-11 Thread Cameron Todd (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397301#comment-17397301
 ] 

Cameron Todd commented on SPARK-18105:
--

Ok I added the zip file on this public S3 bucket, it holds a parquet file with 
snappy compression, it's 1.75gb. You can retrieve the data as such
{code:java}
wget 
https://storage.gra.cloud.ovh.net/v1/AUTH_147b880980f148f5ad1af09542e6f37a/public_data/SPARK18105/hashed_data.zip
{code}

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36413) Spark and Hive are inconsistent in inferring the expression type in the view, resulting in AnalysisException

2021-08-11 Thread bingfeng.guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bingfeng.guo updated SPARK-36413:
-
Priority: Blocker  (was: Critical)

> Spark and Hive are inconsistent in inferring the expression type in the view, 
> resulting in AnalysisException
> 
>
> Key: SPARK-36413
> URL: https://issues.apache.org/jira/browse/SPARK-36413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 3.1.1
>Reporter: bingfeng.guo
>Priority: Blocker
>
> The scene of the restoration is as follows:
>  
> Suppose I have a hive table:
> {quote}> desc sales;
>  id              bigint
> name       varchar(4096)
> price        decimal(19,4)
> {quote}
> and I create a View:
> {quote}CREATE VIEW `sales_view_bf` AS select `sales`.`id`,
>  case when `sales`.`name` = 'abc' then `sales`.`price`*1.1 else 0 end as 
> `price_new`
>  from `SALES`;
> > desc sales_view_bf;
>  id              bigint
> price        double
> {quote}
> then I query follow sql on hive with spark:
> {quote}select PRICE_NEW from SALES_VIEW_BF;
> {quote}
> will throw:
> {quote}Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast 
> `price_new` from decimal(22,5) to double as it may truncateCaused by: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `price_new` from 
> decimal(22,5) to double as it may truncate; at 
> org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1$$anonfun$2.apply(view.scala:78)
>  at 
> org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1$$anonfun$2.apply(view.scala:72)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:296) at 
> org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1.applyOrElse(view.scala:72)
>  at 
> org.apache.spark.sql.catalyst.analysis.AliasViewChild$$anonfun$apply$1.applyOrElse(view.scala:51)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) 
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>  at 
> 

[jira] [Commented] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397279#comment-17397279
 ] 

Apache Spark commented on SPARK-36478:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/33702

> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side
> ---
>
> Key: SPARK-36478
> URL: https://issues.apache.org/jira/browse/SPARK-36478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wan Kun
>Priority: Minor
>
> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side.
> For example:
> {code:java}
> spark.range(200L).selectExpr("id AS a").createTempView("t1")
> spark.range(300L).selectExpr("id AS b").createTempView("t2")
> spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = 
> b").explain(true){code}
> Current optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
> +- Project [b#3L, c#4L]
>+- Join LeftOuter, (a#2L = a#10L)
>   :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
>   :  +- Range (0, 200, step=1, splits=Some(1))
>   +- Project [id#8L AS a#10L]
>  +- Range (0, 300, step=1, splits=Some(1))
> {code}
> Expected optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
> +- Project [id#274L AS b#277L, id#274L AS c#278L]
>+- Range (0, 200, step=1, splits=Some(2))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36479) improve datetime test coverage in SQL files

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397278#comment-17397278
 ] 

Apache Spark commented on SPARK-36479:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33707

> improve datetime test coverage in SQL files
> ---
>
> Key: SPARK-36479
> URL: https://issues.apache.org/jira/browse/SPARK-36479
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36479) improve datetime test coverage in SQL files

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36479:


Assignee: Apache Spark

> improve datetime test coverage in SQL files
> ---
>
> Key: SPARK-36479
> URL: https://issues.apache.org/jira/browse/SPARK-36479
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397276#comment-17397276
 ] 

Apache Spark commented on SPARK-36478:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/33702

> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side
> ---
>
> Key: SPARK-36478
> URL: https://issues.apache.org/jira/browse/SPARK-36478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wan Kun
>Priority: Minor
>
> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side.
> For example:
> {code:java}
> spark.range(200L).selectExpr("id AS a").createTempView("t1")
> spark.range(300L).selectExpr("id AS b").createTempView("t2")
> spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = 
> b").explain(true){code}
> Current optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
> +- Project [b#3L, c#4L]
>+- Join LeftOuter, (a#2L = a#10L)
>   :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
>   :  +- Range (0, 200, step=1, splits=Some(1))
>   +- Project [id#8L AS a#10L]
>  +- Range (0, 300, step=1, splits=Some(1))
> {code}
> Expected optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
> +- Project [id#274L AS b#277L, id#274L AS c#278L]
>+- Range (0, 200, step=1, splits=Some(2))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36478:


Assignee: Apache Spark

> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side
> ---
>
> Key: SPARK-36478
> URL: https://issues.apache.org/jira/browse/SPARK-36478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Minor
>
> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side.
> For example:
> {code:java}
> spark.range(200L).selectExpr("id AS a").createTempView("t1")
> spark.range(300L).selectExpr("id AS b").createTempView("t2")
> spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = 
> b").explain(true){code}
> Current optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
> +- Project [b#3L, c#4L]
>+- Join LeftOuter, (a#2L = a#10L)
>   :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
>   :  +- Range (0, 200, step=1, splits=Some(1))
>   +- Project [id#8L AS a#10L]
>  +- Range (0, 300, step=1, splits=Some(1))
> {code}
> Expected optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
> +- Project [id#274L AS b#277L, id#274L AS c#278L]
>+- Range (0, 200, step=1, splits=Some(2))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36479) improve datetime test coverage in SQL files

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36479:


Assignee: (was: Apache Spark)

> improve datetime test coverage in SQL files
> ---
>
> Key: SPARK-36479
> URL: https://issues.apache.org/jira/browse/SPARK-36479
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36478:


Assignee: (was: Apache Spark)

> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side
> ---
>
> Key: SPARK-36478
> URL: https://issues.apache.org/jira/browse/SPARK-36478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wan Kun
>Priority: Minor
>
> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side.
> For example:
> {code:java}
> spark.range(200L).selectExpr("id AS a").createTempView("t1")
> spark.range(300L).selectExpr("id AS b").createTempView("t2")
> spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = 
> b").explain(true){code}
> Current optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
> +- Project [b#3L, c#4L]
>+- Join LeftOuter, (a#2L = a#10L)
>   :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
>   :  +- Range (0, 200, step=1, splits=Some(1))
>   +- Project [id#8L AS a#10L]
>  +- Range (0, 300, step=1, splits=Some(1))
> {code}
> Expected optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
> +- Project [id#274L AS b#277L, id#274L AS c#278L]
>+- Range (0, 200, step=1, splits=Some(2))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36479) improve datetime test coverage in SQL files

2021-08-11 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36479:
---

 Summary: improve datetime test coverage in SQL files
 Key: SPARK-36479
 URL: https://issues.apache.org/jira/browse/SPARK-36479
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2021-08-11 Thread Wan Kun (Jira)
Wan Kun created SPARK-36478:
---

 Summary: Removes outer join if all grouping and aggregate 
expressions are from the streamed side
 Key: SPARK-36478
 URL: https://issues.apache.org/jira/browse/SPARK-36478
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wan Kun


Removes outer join if all grouping and aggregate expressions are from the 
streamed side.

For example:
{code:java}
spark.range(200L).selectExpr("id AS a").createTempView("t1")
spark.range(300L).selectExpr("id AS b").createTempView("t2")
spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true){code}
Current optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
+- Project [b#3L, c#4L]
   +- Join LeftOuter, (a#2L = a#10L)
  :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
  :  +- Range (0, 200, step=1, splits=Some(1))
  +- Project [id#8L AS a#10L]
 +- Range (0, 300, step=1, splits=Some(1))
{code}
Expected optimized plan:
{code:java}
== Optimized Logical Plan ==
Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
+- Project [id#274L AS b#277L, id#274L AS c#278L]
   +- Range (0, 200, step=1, splits=Some(2))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36468) Update docs about ANSI interval literals

2021-08-11 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36468.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33693
[https://github.com/apache/spark/pull/33693]

> Update docs about ANSI interval literals
> 
>
> Key: SPARK-36468
> URL: https://issues.apache.org/jira/browse/SPARK-36468
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Update the page 
> https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36472) Improve SQL syntax for MERGE

2021-08-11 Thread Denis Krivenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Krivenko updated SPARK-36472:
---
Description: 
Existing SQL syntax for *MEGRE* (see Delta Lake examples 
[here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge]
 and 
[here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into])
 could be improved by adding an alternative for {{}}

*Main assumption*
 In common cases target and source tables have the same column names used in 
{{}} as merge keys, for example:
{code:sql}
ON target.key1 = source.key1 AND target.key2 = source.key2{code}
It would be more convenient to use a syntax similar to:
{code:sql}
ON COLUMNS (key1, key2)
-- or
ON MATCHING (key1, key2)
{code}
The same approach is used for 
[JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] 
where {{join_criteria}} syntax is
{code:sql}
ON boolean_expression | USING ( column_name [ , ... ] )
{code}
*Improvement proposal*
 Syntax
{code:sql}
MERGE INTO target_table_identifier [AS target_alias]
USING source_table_identifier [] [AS source_alias]
ON {  | COLUMNS ( column_name [ , ... ] ) }
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN NOT MATCHED [ AND  ]  THEN  ]
{code}
Example
{code:sql}
MERGE INTO target
USING source
ON COLUMNS (key1, key2)
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
{code}

  was:
Existing SQL syntax for *MEGRE* (see Delta Lake examples 
[here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge]
 and 
[here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into])
 could be improved by adding an alternative for {{}}

*Main assumption*
 In common cases target and source tables have the same column names used in 
{{}} as merge keys, for example:
{code:sql}
ON target.key1 = source.key1 AND target.key2 = source.key2{code}
It would be more convenient to use a syntax similar to:
{code:sql}
ON COLUMNS (key1, key2)
{code}
The same approach is used for 
[JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] 
where {{join_criteria}} syntax is
{code:sql}
ON boolean_expression | USING ( column_name [ , ... ] )
{code}
*Improvement proposal*
 Syntax
{code:sql}
MERGE INTO target_table_identifier [AS target_alias]
USING source_table_identifier [] [AS source_alias]
ON {  | COLUMNS ( column_name [ , ... ] ) }
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN NOT MATCHED [ AND  ]  THEN  ]
{code}
Example
{code:sql}
MERGE INTO target
USING source
ON COLUMNS (key1, key2)
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
{code}


> Improve SQL syntax for MERGE
> 
>
> Key: SPARK-36472
> URL: https://issues.apache.org/jira/browse/SPARK-36472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Priority: Trivial
>
> Existing SQL syntax for *MEGRE* (see Delta Lake examples 
> [here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge]
>  and 
> [here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into])
>  could be improved by adding an alternative for {{}}
> *Main assumption*
>  In common cases target and source tables have the same column names used in 
> {{}} as merge keys, for example:
> {code:sql}
> ON target.key1 = source.key1 AND target.key2 = source.key2{code}
> It would be more convenient to use a syntax similar to:
> {code:sql}
> ON COLUMNS (key1, key2)
> -- or
> ON MATCHING (key1, key2)
> {code}
> The same approach is used for 
> [JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html]
>  where {{join_criteria}} syntax is
> {code:sql}
> ON boolean_expression | USING ( column_name [ , ... ] )
> {code}
> *Improvement proposal*
>  Syntax
> {code:sql}
> MERGE INTO target_table_identifier [AS target_alias]
> USING source_table_identifier [] [AS source_alias]
> ON {  | COLUMNS ( column_name [ , ... ] ) }
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN NOT MATCHED [ AND  ]  THEN  ]
> {code}
> Example
> {code:sql}
> MERGE INTO target
> USING source
> ON COLUMNS (key1, key2)
> WHEN MATCHED THEN
> UPDATE SET *
> WHEN NOT MATCHED THEN
> INSERT *
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397236#comment-17397236
 ] 

Apache Spark commented on SPARK-36353:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33705

> RemoveNoopOperators should keep output schema
> -
>
> Key: SPARK-36353
> URL: https://issues.apache.org/jira/browse/SPARK-36353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2021-07-30-17-46-59-196.png
>
>
> !image-2021-07-30-17-46-59-196.png|width=539,height=220!
> [https://github.com/apache/spark/pull/33587]
>  
> Only first level?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397238#comment-17397238
 ] 

Apache Spark commented on SPARK-36353:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33704

> RemoveNoopOperators should keep output schema
> -
>
> Key: SPARK-36353
> URL: https://issues.apache.org/jira/browse/SPARK-36353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2021-07-30-17-46-59-196.png
>
>
> !image-2021-07-30-17-46-59-196.png|width=539,height=220!
> [https://github.com/apache/spark/pull/33587]
>  
> Only first level?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36477:


Assignee: Apache Spark

> Infering schema from JSON file shall respect ignoreCorruptFiles and handle 
> IOE 
> ---
>
> Key: SPARK-36477
> URL: https://issues.apache.org/jira/browse/SPARK-36477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397237#comment-17397237
 ] 

Apache Spark commented on SPARK-36477:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33706

> Infering schema from JSON file shall respect ignoreCorruptFiles and handle 
> IOE 
> ---
>
> Key: SPARK-36477
> URL: https://issues.apache.org/jira/browse/SPARK-36477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36477:


Assignee: (was: Apache Spark)

> Infering schema from JSON file shall respect ignoreCorruptFiles and handle 
> IOE 
> ---
>
> Key: SPARK-36477
> URL: https://issues.apache.org/jira/browse/SPARK-36477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36477) Infering schema from JSON file shall respect ignoreCorruptFiles and handle IOE

2021-08-11 Thread Kent Yao (Jira)
Kent Yao created SPARK-36477:


 Summary: Infering schema from JSON file shall respect 
ignoreCorruptFiles and handle IOE 
 Key: SPARK-36477
 URL: https://issues.apache.org/jira/browse/SPARK-36477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397206#comment-17397206
 ] 

Apache Spark commented on SPARK-36352:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33703

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397205#comment-17397205
 ] 

Apache Spark commented on SPARK-36352:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33703

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397152#comment-17397152
 ] 

Apache Spark commented on SPARK-36475:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33700

> Add doc about spark.shuffle.service.fetch.rdd.enabled
> -
>
> Key: SPARK-36475
> URL: https://issues.apache.org/jira/browse/SPARK-36475
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Add doc about spark.shuffle.service.fetch.rdd.enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397151#comment-17397151
 ] 

Apache Spark commented on SPARK-36475:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33700

> Add doc about spark.shuffle.service.fetch.rdd.enabled
> -
>
> Key: SPARK-36475
> URL: https://issues.apache.org/jira/browse/SPARK-36475
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Add doc about spark.shuffle.service.fetch.rdd.enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36475:


Assignee: (was: Apache Spark)

> Add doc about spark.shuffle.service.fetch.rdd.enabled
> -
>
> Key: SPARK-36475
> URL: https://issues.apache.org/jira/browse/SPARK-36475
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Add doc about spark.shuffle.service.fetch.rdd.enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36475) Add doc about spark.shuffle.service.fetch.rdd.enabled

2021-08-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36475:


Assignee: Apache Spark

> Add doc about spark.shuffle.service.fetch.rdd.enabled
> -
>
> Key: SPARK-36475
> URL: https://issues.apache.org/jira/browse/SPARK-36475
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Add doc about spark.shuffle.service.fetch.rdd.enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34808) Removes outer join if it only has distinct on streamed side

2021-08-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397148#comment-17397148
 ] 

Apache Spark commented on SPARK-34808:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/33702

> Removes outer join if it only has distinct on streamed side
> ---
>
> Key: SPARK-34808
> URL: https://issues.apache.org/jira/browse/SPARK-34808
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> For example:
> {code:scala}
> spark.range(200L).selectExpr("id AS a").createTempView("t1")
> spark.range(300L).selectExpr("id AS b").createTempView("t2")
> spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true)
> {code}
> Current optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [a#2L], [a#2L]
> +- Project [a#2L]
>+- Join LeftOuter, (a#2L = b#6L)
>   :- Project [id#0L AS a#2L]
>   :  +- Range (0, 200, step=1, splits=Some(2))
>   +- Project [id#4L AS b#6L]
>  +- Range (0, 300, step=1, splits=Some(2))
> {noformat}
> Expected optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [a#2L], [a#2L]
> +- Project [id#0L AS a#2L]
>+- Range (0, 200, step=1, splits=Some(2))
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >