spark df.write.partitionBy run very slow

JF Chen Mon, 04 Mar 2019 02:39:18 -0800

I am trying to write data in dataset to hdfs via df.write.partitionBy(column_a,
column_b, column_c).parquet(output_path)
However, it costs several minutes to write only hundreds of MB data to
hdfs.
>From this article
<https://stackoverflow.com/questions/45269658/spark-df-write-partitionby-run-very-slow>,
adding repartition method before write should work. But if there is data
skew, some tasks may cost much longer time than average, which still cost
much time.
How to solve this problem? Thanks in advance !



Regard,
Junfeng Chen

spark df.write.partitionBy run very slow

Reply via email to