Re: coalesce serialising earlier work

ayan guha Tue, 09 Aug 2016 02:42:40 -0700

How about running a count step to force spark to materialise data frame and
then repartition to 1?
On 9 Aug 2016 17:11, "Adrian Bridgett" <adr...@opensignal.com> wrote:


> In short:  df.coalesce(1).write seems to make all the earlier calculations
> about the dataframe go through a single task (rather than on multiple tasks
> and then the final dataframe to be sent through a single worker).  Any idea
> how we can force the job to run in parallel?
>
> In more detail:
>
> We have a job that we wish to write out as a single CSV file.  We have two
> approaches (code below):
>
> df = ....(filtering, calculations)
> df.coalesce(num).write.
>       format("com.databricks.spark.csv").
>       option("codec", "org.apache.hadoop.io.compress.GzipCodec").
>       save(output_path)
> Option A: (num=100)
> - dataframe calculated in parallel
> - upload in parallel
> - gzip in parallel
> - but we then have to run "hdfs dfs -getmerge" to download all data and
> then write it back again.
>
> Option B: (num=1)
> - single gzip (but gzip is pretty quick)
> - uploads go through a single machine
> - no HDFS commands
> - dataframe is _not_ calculated in parallel (we can see filters getting
> just a single task)
>
> What I'm not sure is why spark (1.6.1) is deciding to run just a single
> task for the calculation - and what we can do about it?   We really want
> the df to be calculated in parallel and then this is _then_ coalesced
> before being written.  (It may be that the -getmerge approach will still be
> faster)
>
> df.coalesce(100).coalesce(1).write.....  doesn't look very likely to help!
>
> Adrian
> --
> *Adrian Bridgett*
>

Re: coalesce serialising earlier work

Reply via email to