[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2021-06-06 Thread GitBox
wangyum commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-855549147 @cloud-fan @HyukjinKwon Could we review https://github.com/apache/spark/pull/32781 first? This pr need to add support datasource v2 and the test is not very robust. I plan to

[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2021-05-30 Thread GitBox
wangyum commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-851119698 @HyukjinKwon I mainly want to make the whole cluster more stable. If a user does not add it manually, a large number of files may be generated. Please see this picture:

[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-12-13 Thread GitBox
wangyum commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-744131292 Thank you all. Merged it to our internal Spark version. This is an automated message from the Apache Git

[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-06-15 Thread GitBox
wangyum commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-643936564 Yes, this strategy may introduce the data skew issue, but the case of skewed data will only affect itself. Creating a large number of files will affect the Namenode, which will

[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-06-13 Thread GitBox
wangyum commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-643706474 > @wangyum Question, if we have a repartition hint on p1 and p2 in the SELECT query would it have similar effect ? Yes. It have similar effect.

[GitHub] [spark] wangyum commented on pull request #28032: [SPARK-31264][SQL] Repartition by dynamic partition columns before insert partition table

2020-06-13 Thread GitBox
wangyum commented on pull request #28032: URL: https://github.com/apache/spark/pull/28032#issuecomment-643575562 Thank you @gengliangwang The root cause is repartition by dynamic partition columns can significantly reduce the number of files: