Thank you liu. Can you please explain what do you mean by enabling spark
fault tolerant mechanism?
I observed that after all tasks finishes, spark is working on concatenating
same partitions from all tasks on file system. eg,
task1 - partition1, partition2, partition3
task2 - partition1, partition2, partition3

Then after task1, task2 finishes, spark concatenates partition1 from task1,
task2 to create partition1. This is taking longer if we have large number
of files. I am not sure if there is a way to let spark not to concatenate
partitions from each task.

Thanks
Swapnil


On Tue, Mar 7, 2017 at 10:47 PM, cht liu <liucht...@gmail.com> wrote:

> Do you enable the spark fault tolerance mechanism, RDD run at the end of
> the job, will start a separate job, to the checkpoint data written to the
> file system before the persistence of high availability
>
> 2017-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>:
>
>> Hello all
>>    I have a spark job that reads parquet data and partition it based on
>> one of the columns. I made sure partitions equally distributed and not
>> skewed. My code looks like this -
>>
>> datasetA.write.partitonBy("column1").parquet(outputPath)
>>
>> Execution plan -
>> [image: Inline image 1]
>>
>> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
>> to close application. I am not sure what spark is doing after all tasks are
>> processes successfully.
>> I checked thread dump (using UI executor tab) on few executors but
>> couldnt find anything major. Overall, few shuffle-client processes are
>> "RUNNABLE" and few dispatched-* processes are "WAITING".
>>
>> Please let me know what spark is doing at this stage(after all tasks
>> finished) and any way I can optimize it.
>>
>> Thanks
>> Swapnil
>>
>>
>>
>

Reply via email to