Re: Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread Chris Miller
Why does the order matter? Coalesce runs in parallel and if it's just writing to the file, then I imagine it would do it in whatever order it happens to be executed in each thread. If you want to sort the resulting data, I imagine you'd need to save it to some sort of data structure instead of

Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread jelez
My streaming job is creating files on S3. The problem is that those files end up very small if I just write them to S3 directly. This is why I use coalesce() to reduce the number of files and make them larger. However, coalesce shuffles data and my job processing time ends up higher than