Hi Team, We are running into this poor performance issue and seeking your suggestion on how to improve it:
We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough). We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations. We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people. Thanks James