How about running a count step to force spark to materialise data frame and then repartition to 1? On 9 Aug 2016 17:11, "Adrian Bridgett" <adr...@opensignal.com> wrote:
> In short: df.coalesce(1).write seems to make all the earlier calculations > about the dataframe go through a single task (rather than on multiple tasks > and then the final dataframe to be sent through a single worker). Any idea > how we can force the job to run in parallel? > > In more detail: > > We have a job that we wish to write out as a single CSV file. We have two > approaches (code below): > > df = ....(filtering, calculations) > df.coalesce(num).write. > format("com.databricks.spark.csv"). > option("codec", "org.apache.hadoop.io.compress.GzipCodec"). > save(output_path) > Option A: (num=100) > - dataframe calculated in parallel > - upload in parallel > - gzip in parallel > - but we then have to run "hdfs dfs -getmerge" to download all data and > then write it back again. > > Option B: (num=1) > - single gzip (but gzip is pretty quick) > - uploads go through a single machine > - no HDFS commands > - dataframe is _not_ calculated in parallel (we can see filters getting > just a single task) > > What I'm not sure is why spark (1.6.1) is deciding to run just a single > task for the calculation - and what we can do about it? We really want > the df to be calculated in parallel and then this is _then_ coalesced > before being written. (It may be that the -getmerge approach will still be > faster) > > df.coalesce(100).coalesce(1).write..... doesn't look very likely to help! > > Adrian > -- > *Adrian Bridgett* >