Spark is doing operations on each partition in parallel. If you decrease number of partitions, you’re potentially doing less work in parallel depending on your cluster setup.
> On May 23, 2017, at 4:23 PM, Andrii Biletskyi > <andrii.bilets...@yahoo.com.INVALID> wrote: > > > No, I didn't try to use repartition, how exactly it impacts the parallelism? > In my understanding coalesce simply "unions" multiple partitions located on > same executor "one on on top of the other", while repartition does hash-based > shuffle decreasing the number of output partitions. So how this exactly > affects the parallelism, which stage of the job? > > Thanks, > Andrii > > > > On Tuesday, May 23, 2017 10:20 PM, Michael Armbrust <mich...@databricks.com> > wrote: > > > coalesce is nice because it does not shuffle, but the consequence of avoiding > a shuffle is it will also reduce parallelism of the preceding computation. > Have you tried using repartition instead? > > On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi > <andrii.bilets...@yahoo.com.invalid > <mailto:andrii.bilets...@yahoo.com.invalid>> wrote: > Hi all, > > I'm trying to understand the impact of coalesce operation on spark job > performance. > > As a side note: were are using emrfs (i.e. aws s3) as source and a target for > the job. > > Omitting unnecessary details job can be explained as: join 200M records > Dataframe stored in orc format on emrfs with another 200M records cached > Dataframe, the result of the join put back to emrfs. First DF is a set of > wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark > shows 20 GB). > > I have enough resources in my cluster to perform the job but I don't like the > fact that output datasource contains 200 part orc files (as > spark.sql.shuffle. partitions defaults to 200) so before saving orc to emrfs > I'm doing .coalesce(10). From documentation coalesce looks like a quite > harmless operations: no repartitioning etc. > > But with such setup my job fails to write dataset on the last stage. Right > now the error is OOM: GC overhead. When I change .coalesce(10) to > .coalesce(100) the job runs much faster and finishes without errors. > > So what's the impact of .coalesce in this case? And how to do in place > concatenation of files (not involving hive) to end up with smaller amount of > bigger files, as with .coalesce(100) job generates 100 orc snappy encoded > files ~300MB each. > > Thanks, > Andrii > > >