AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-695464072
> Then we need some estimation work, as the shuffle/scan node may be far
away from the join node. We also need to carefully justify if the extra shuffle
cost worths the
AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256
This is an automated message from the Apache Git Service.
To respond to the message, please log on to
AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256
This is an automated message from the Apache Git Service.
To respond to the message, please log on to
AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256
This is an automated message from the Apache Git Service.
To respond to the message, please log on to
AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690936960
> Spark query optimizer should not add the extra shuffle by itself, as it's
likely to cause perf regression.
With this rule, we can't handle such data skew case
AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690873256
> > after stream side executed, we will get the raw count of each partition
and judge if it's skewed seriously, if skewed seriously and volume is large,
repartition stream
AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690369334
> What's the high-level idea? We can handle skew SMJ because there is a
shuffle and we can split the partition with the granularity of shuffle blocks.
Broadcast join
AngersZh commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690156089
> Ah, one more comment; could you update the code comment in
`OptimizeSkewedJoin`? It looks most comments assume smj only, e.g.,