Hi!
Actually *coalesce()* is usually a cheap operation as it moves some
existing partitions from one node to another. So it is not a (full) shuffle.
See the documentation
Hi!
Actually *coalesce()* is usually a cheap operation as it moves some
existing partitions from one node to another. So it is not a (full) shuffle.
See the documentation coalesce is a cheap operation as
it moves some existing partitions from one node to another. So it is not a
full shuffle. See
HI Pedro,
What is your usecase ,why you used coqlesce ,coalesce() is very expensive
operations as they shuffle the data across many partitions hence try to
minimize repartition as much as possible.
Regards,
Vaquar khan
On Thu, Mar 18, 2021, 5:47 PM Pedro Tuero wrote:
> I was reviewing a
I was reviewing a spark java application running on aws emr.
The code was like:
RDD.reduceByKey(func).coalesce(number).saveAsTextFile()
That stage took hours to complete.
I changed to:
RDD.reduceByKey(func, number).saveAsTextFile()
And it now takes less than 2 minutes, and the final output is