HI Pedro,

What is your usecase ,why you used coqlesce ,coalesce() is very expensive
operations as they shuffle the data across many partitions hence try to
minimize repartition as much as possible.

Regards,
Vaquar khan


On Thu, Mar 18, 2021, 5:47 PM Pedro Tuero <tuerope...@gmail.com> wrote:

> I was reviewing a spark java application running on aws emr.
>
> The code was like:
> RDD.reduceByKey(func).coalesce(number).saveAsTextFile()
>
> That stage took hours to complete.
> I changed to:
> RDD.reduceByKey(func, number).saveAsTextFile()
> And it now takes less than 2 minutes, and the final output is the same.
>
> So, is it a bug or a feature?
> Why spark doesn't treat a coalesce after a reduce like a reduce with
> output partitions parameterized?
>
> Just for understanding,
> Thanks,
> Pedro.
>
>
>
>

Reply via email to