[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . Thank you for your feedback. The following looks a little wrong to me because the above optimization was one of the recommendations for many

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-15 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658831236 BTW, @aokolnychyi . I merged the corner case test case PR, https://github.com/apache/spark/pull/29118. Could you rebase this PR to the master? Then, we can discuss

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-15 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658828708 @hvanhovell . I agree with you for the followings. > AFAIK nested ordering can be ignored from a relation algebra point of view. > Regarding the shuffles.

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658559706 The most big factor is file formats instead of Spark side. For example, in the above example, ORC files are small because it supports a special encoding when the

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658559706 No~ It depends on file formats instead of Spark side. For example, in the above example, ORC files are small because it supports a special encoding when the

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475 To generate small final Parquet/ORC files, we do the above tricks, don't we? This may cause a regression on the size of output storage.

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475 To generate small final Parquet/ORC files, we do the above tricks, don't we? This PR may cause a regression on the size of output storage.

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun edited a comment on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475 To generate small final Parquet/ORC files, we do the above tricks, don't we? This is an automated