I would suggest repartioning it to reasonable partitions may ne 500 and save it to some intermediate working directory . Finally read all the files from this working dir and then coalesce as 1 and save to final location.
Thanks Deepak On Fri, Mar 9, 2018, 20:12 Vadim Semenov <va...@datadoghq.com> wrote: > because `coalesce` gets propagated further up in the DAG in the last > stage, so your last stage only has one task. > > You need to break your DAG so your expensive operations would be in a > previous stage before the stage with `.coalesce(1)` > > On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim < > rezaul.ka...@insight-centre.org> wrote: > >> Dear All, >> >> I have a tiny CSV file, which is around 250MB. There are only 30 columns >> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an >> another CSV file on disk for later usage. >> >> However, I'm getting pissed off as writing the resultant DataFrame is >> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the >> file written on the disk is about 58GB! >> >> Here's the sample code that I tried: >> >> # Using repartition() >> >> myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv") >> >> # Using coalesce() >> myDF. >> coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv") >> >> >> Any better suggestion? >> >> >> >> ---- >> Md. Rezaul Karim, BSc, MSc >> Research Scientist, Fraunhofer FIT, Germany >> >> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany >> >> eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de> >> Tel: +49 241 80-21527 <+49%20241%208021527> >> > > > > -- > Sent from my iPhone >