I am trying to write data in dataset to hdfs via df.write.partitionBy(column_a, column_b, column_c).parquet(output_path) However, it costs several minutes to write only hundreds of MB data to hdfs. >From this article <https://stackoverflow.com/questions/45269658/spark-df-write-partitionby-run-very-slow>, adding repartition method before write should work. But if there is data skew, some tasks may cost much longer time than average, which still cost much time. How to solve this problem? Thanks in advance !
Regard, Junfeng Chen