Why does the order matter? Coalesce runs in parallel and if it's just
writing to the file, then I imagine it would do it in whatever order it
happens to be executed in each thread. If you want to sort the resulting
data, I imagine you'd need to save it to some sort of data structure
instead of
My streaming job is creating files on S3.
The problem is that those files end up very small if I just write them to S3
directly.
This is why I use coalesce() to reduce the number of files and make them
larger.
However, coalesce shuffles data and my job processing time ends up higher
than