Re: Coalesce vs reduce operation parameter

2021-03-20 Thread Attila Zsolt Piros
Hi! Actually *coalesce()* is usually a cheap operation as it moves some existing partitions from one node to another. So it is not a (full) shuffle. See the documentation

Re: Coalesce vs reduce operation parameter

2021-03-20 Thread Attila Zsolt Piros
Hi! Actually *coalesce()* is usually a cheap operation as it moves some existing partitions from one node to another. So it is not a (full) shuffle. See the documentation coalesce is a cheap operation as it moves some existing partitions from one node to another. So it is not a full shuffle. See

Re: Coalesce vs reduce operation parameter

2021-03-20 Thread vaquar khan
HI Pedro, What is your usecase ,why you used coqlesce ,coalesce() is very expensive operations as they shuffle the data across many partitions hence try to minimize repartition as much as possible. Regards, Vaquar khan On Thu, Mar 18, 2021, 5:47 PM Pedro Tuero wrote: > I was reviewing a

Coalesce vs reduce operation parameter

2021-03-18 Thread Pedro Tuero
I was reviewing a spark java application running on aws emr. The code was like: RDD.reduceByKey(func).coalesce(number).saveAsTextFile() That stage took hours to complete. I changed to: RDD.reduceByKey(func, number).saveAsTextFile() And it now takes less than 2 minutes, and the final output is