Poor performance caused by coalesce to 1

James Yu Wed, 03 Feb 2021 10:55:06 -0800

Hi Team,

We are running into this poor performance issue and seeking your suggestion on 
how to improve it:


We have a particular dataset which we aggregate from other datasets and like to 
write out to one single file (because it is small enough).  We found that after 
a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD 
to 1 partition before writing it out, and this coalesce degrade the 
performance, not that this additional coalesce operation took additional 
runtime, but it somehow dictates the partitions to use in the upstream 
transformations.

We hope there is a simple and useful way to solve this kind of issue which we 
believe is quite common for many people.


Thanks

James

Poor performance caused by coalesce to 1

Reply via email to