Hi JF , Try to execute it before df.write.... //count by partition_id import org.apache.spark.sql.functions.spark_partition_id df.groupBy(spark_partition_id).count.show()
You will come to know how data has been partitioned inside df. Small trick we can apply here while partitionBy(column_a, column_b, column_c) Makes sure you should have ( column_a partitions) > ( column_b partitions) > ( column_c partitions) . Try this. Regards, Shyam On Mon, Mar 4, 2019 at 4:09 PM JF Chen <darou...@gmail.com> wrote: > I am trying to write data in dataset to hdfs via > df.write.partitionBy(column_a, > column_b, column_c).parquet(output_path) > However, it costs several minutes to write only hundreds of MB data to > hdfs. > From this article > <https://stackoverflow.com/questions/45269658/spark-df-write-partitionby-run-very-slow>, > adding repartition method before write should work. But if there is data > skew, some tasks may cost much longer time than average, which still cost > much time. > How to solve this problem? Thanks in advance ! > > > Regard, > Junfeng Chen >