Re: Impact of coalesce operation before writing dataframe

John Compitello Tue, 23 May 2017 13:43:24 -0700

Spark is doing operations on each partition in parallel. If you decrease number 
of partitions, you’re potentially doing less work in parallel depending on your 
cluster setup.


> On May 23, 2017, at 4:23 PM, Andrii Biletskyi 
> <andrii.bilets...@yahoo.com.INVALID> wrote:
> 
>  
> No, I didn't try to use repartition, how exactly it impacts the parallelism?
> In my understanding coalesce simply "unions" multiple partitions located on 
> same executor "one on on top of the other", while repartition does hash-based 
> shuffle decreasing the number of output partitions. So how this exactly 
> affects the parallelism, which stage of the job?
> 
> Thanks,
> Andrii
> 
> 
> 
> On Tuesday, May 23, 2017 10:20 PM, Michael Armbrust <mich...@databricks.com> 
> wrote:
> 
> 
> coalesce is nice because it does not shuffle, but the consequence of avoiding 
> a shuffle is it will also reduce parallelism of the preceding computation.  
> Have you tried using repartition instead?
> 
> On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi 
> <andrii.bilets...@yahoo.com.invalid 
> <mailto:andrii.bilets...@yahoo.com.invalid>> wrote:
> Hi all,
> 
> I'm trying to understand the impact of coalesce operation on spark job 
> performance.
> 
> As a side note: were are using emrfs (i.e. aws s3) as source and a target for 
> the job.
> 
> Omitting unnecessary details job can be explained as: join 200M records 
> Dataframe stored in orc format on emrfs with another 200M records cached 
> Dataframe, the result of the join put back to emrfs. First DF is a set of 
> wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark 
> shows 20 GB).
> 
> I have enough resources in my cluster to perform the job but I don't like the 
> fact that output datasource contains 200 part orc files (as 
> spark.sql.shuffle. partitions defaults to 200) so before saving orc to emrfs 
> I'm doing .coalesce(10). From documentation coalesce looks like a quite 
> harmless operations: no repartitioning etc.
> 
> But with such setup my job fails to write dataset on the last stage. Right 
> now the error is OOM: GC overhead. When I change  .coalesce(10) to 
> .coalesce(100) the job runs much faster and finishes without errors.
> 
> So what's the impact of .coalesce in this case? And how to do in place 
> concatenation of files (not involving hive) to end up with smaller amount of 
> bigger files, as with .coalesce(100) job generates 100 orc snappy encoded 
> files ~300MB each.
> 
> Thanks,
> Andrii
> 
> 
>

Re: Impact of coalesce operation before writing dataframe

Reply via email to