You can do this in 2 passes (not one) A) save you dataset into hdfs with what you have. B) calculate number of partition, n= (size of your dataset)/hdfs block size Then run simple spark job to read and partition based on 'n'.
Hichame From: felixcheun...@hotmail.com Sent: January 19, 2019 2:06 PM To: 28shivamsha...@gmail.com; user@spark.apache.org Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size. You can call coalesce to combine partitions.. ________________________________ From: Shivam Sharma <28shivamsha...@gmail.com> Sent: Saturday, January 19, 2019 7:43 AM To: user@spark.apache.org Subject: Persist Dataframe to HDFS considering HDFS Block Size. Hi All, I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell spark to persist according to HDFS Blocks. We have something like this HIVE which solves this problem: set hive.merge.sparkfiles=true; set hive.merge.smallfiles.avgsize=2048000000; set hive.merge.size.per.task=4096000000; Thanks -- Shivam Sharma Indian Institute Of Information Technology, Design and Manufacturing Jabalpur Mobile No- (+91) 8882114744 Email:- 28shivamsha...@gmail.com<mailto:28shivamsha...@gmail.com> LinkedIn:-https://www.linkedin.com/in/28shivamsharma