[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-19 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-695464072 > Then we need some estimation work, as the shuffle/scan node may be far away from the join node. We also need to carefully justify if the extra shuffle cost worths the

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-12 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256 This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-12 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256 This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-12 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256 This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-11 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690936960 > Spark query optimizer should not add the extra shuffle by itself, as it's likely to cause perf regression. With this rule, we can't handle such data skew case

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-10 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256 > > after stream side executed, we will get the raw count of each partition and judge if it's skewed seriously, if skewed seriously and volume is large, repartition stream

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-10 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690369334 > What's the high-level idea? We can handle skew SMJ because there is a shuffle and we can split the partition with the granularity of shuffle blocks. Broadcast join

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

2020-09-10 Thread GitBox
AngersZh commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690156089 > Ah, one more comment; could you update the code comment in `OptimizeSkewedJoin`? It looks most comments assume smj only, e.g.,