Thank you liu. Can you please explain what do you mean by enabling spark fault tolerant mechanism? I observed that after all tasks finishes, spark is working on concatenating same partitions from all tasks on file system. eg, task1 - partition1, partition2, partition3 task2 - partition1, partition2, partition3
Then after task1, task2 finishes, spark concatenates partition1 from task1, task2 to create partition1. This is taking longer if we have large number of files. I am not sure if there is a way to let spark not to concatenate partitions from each task. Thanks Swapnil On Tue, Mar 7, 2017 at 10:47 PM, cht liu <liucht...@gmail.com> wrote: > Do you enable the spark fault tolerance mechanism, RDD run at the end of > the job, will start a separate job, to the checkpoint data written to the > file system before the persistence of high availability > > 2017-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>: > >> Hello all >> I have a spark job that reads parquet data and partition it based on >> one of the columns. I made sure partitions equally distributed and not >> skewed. My code looks like this - >> >> datasetA.write.partitonBy("column1").parquet(outputPath) >> >> Execution plan - >> [image: Inline image 1] >> >> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins >> to close application. I am not sure what spark is doing after all tasks are >> processes successfully. >> I checked thread dump (using UI executor tab) on few executors but >> couldnt find anything major. Overall, few shuffle-client processes are >> "RUNNABLE" and few dispatched-* processes are "WAITING". >> >> Please let me know what spark is doing at this stage(after all tasks >> finished) and any way I can optimize it. >> >> Thanks >> Swapnil >> >> >> >