wangyum commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-855549147
@cloud-fan @HyukjinKwon Could we review
https://github.com/apache/spark/pull/32781 first? This pr need to add support
datasource v2 and the test is not very robust. I plan to
wangyum commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-851119698
@HyukjinKwon I mainly want to make the whole cluster more stable. If a user
does not add it manually, a large number of files may be generated. Please see
this picture:
wangyum commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-744131292
Thank you all. Merged it to our internal Spark version.
This is an automated message from the Apache Git
wangyum commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-643936564
Yes, this strategy may introduce the data skew issue, but the case of skewed
data will only affect itself. Creating a large number of files will affect the
Namenode, which will
wangyum commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-643706474
> @wangyum Question, if we have a repartition hint on p1 and p2 in the
SELECT query would it have similar effect ?
Yes. It have similar effect.
wangyum commented on pull request #28032:
URL: https://github.com/apache/spark/pull/28032#issuecomment-643575562
Thank you @gengliangwang The root cause is repartition by dynamic partition
columns can significantly reduce the number of files: