Hi, you are definitely not using SPARK 2.1 in the way it should be used.
Try using sessions, and follow their guidelines, this issue has been specifically resolved as a part of Spark 2.1 release. Regards, Gourav On Wed, Mar 8, 2017 at 8:00 PM, Swapnil Shinde <swapnilushi...@gmail.com> wrote: > Thank you liu. Can you please explain what do you mean by enabling spark > fault tolerant mechanism? > I observed that after all tasks finishes, spark is working on > concatenating same partitions from all tasks on file system. eg, > task1 - partition1, partition2, partition3 > task2 - partition1, partition2, partition3 > > Then after task1, task2 finishes, spark concatenates partition1 from > task1, task2 to create partition1. This is taking longer if we have large > number of files. I am not sure if there is a way to let spark not to > concatenate partitions from each task. > > Thanks > Swapnil > > > On Tue, Mar 7, 2017 at 10:47 PM, cht liu <liucht...@gmail.com> wrote: > >> Do you enable the spark fault tolerance mechanism, RDD run at the end of >> the job, will start a separate job, to the checkpoint data written to the >> file system before the persistence of high availability >> >> 2017-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>: >> >>> Hello all >>> I have a spark job that reads parquet data and partition it based on >>> one of the columns. I made sure partitions equally distributed and not >>> skewed. My code looks like this - >>> >>> datasetA.write.partitonBy("column1").parquet(outputPath) >>> >>> Execution plan - >>> [image: Inline image 1] >>> >>> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 >>> mins to close application. I am not sure what spark is doing after all >>> tasks are processes successfully. >>> I checked thread dump (using UI executor tab) on few executors but >>> couldnt find anything major. Overall, few shuffle-client processes are >>> "RUNNABLE" and few dispatched-* processes are "WAITING". >>> >>> Please let me know what spark is doing at this stage(after all tasks >>> finished) and any way I can optimize it. >>> >>> Thanks >>> Swapnil >>> >>> >>> >> >