I am trying to write data in dataset to hdfs via df.write.partitionBy(column_a,
column_b, column_c).parquet(output_path)
However, it costs several minutes to write only hundreds of MB data to
hdfs.
>From this article
<https://stackoverflow.com/questions/45269658/spark-df-write-partitionby-run-very-slow>,
adding repartition method before write should work. But if there is data
skew, some tasks may cost much longer time than average, which still cost
much time.
How to solve this problem? Thanks in advance !


Regard,
Junfeng Chen

Reply via email to