My streaming job is creating files on S3. The problem is that those files end up very small if I just write them to S3 directly. This is why I use coalesce() to reduce the number of files and make them larger.
However, coalesce shuffles data and my job processing time ends up higher than sparkBatchIntervalMilliseconds. I have observed that if I coalesce the number of partitions to be equal to the cores in the cluster I get less shuffling - but that is unsubstantiated. Is there any dependency/rule between number of executors, number of cores etc. that I can use to minimize shuffling and at the same time achieve minimum number of output files per batch? What is the best practice? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-files-from-streaming-jobs-on-S3-tp26400.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org