Hi,

you are definitely not using SPARK 2.1 in the way it should be used.

Try using sessions, and follow their guidelines, this issue has been
specifically resolved as a part of Spark 2.1 release.


Regards,
Gourav

On Wed, Mar 8, 2017 at 8:00 PM, Swapnil Shinde <swapnilushi...@gmail.com>
wrote:

> Thank you liu. Can you please explain what do you mean by enabling spark
> fault tolerant mechanism?
> I observed that after all tasks finishes, spark is working on
> concatenating same partitions from all tasks on file system. eg,
> task1 - partition1, partition2, partition3
> task2 - partition1, partition2, partition3
>
> Then after task1, task2 finishes, spark concatenates partition1 from
> task1, task2 to create partition1. This is taking longer if we have large
> number of files. I am not sure if there is a way to let spark not to
> concatenate partitions from each task.
>
> Thanks
> Swapnil
>
>
> On Tue, Mar 7, 2017 at 10:47 PM, cht liu <liucht...@gmail.com> wrote:
>
>> Do you enable the spark fault tolerance mechanism, RDD run at the end of
>> the job, will start a separate job, to the checkpoint data written to the
>> file system before the persistence of high availability
>>
>> 2017-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>:
>>
>>> Hello all
>>>    I have a spark job that reads parquet data and partition it based on
>>> one of the columns. I made sure partitions equally distributed and not
>>> skewed. My code looks like this -
>>>
>>> datasetA.write.partitonBy("column1").parquet(outputPath)
>>>
>>> Execution plan -
>>> [image: Inline image 1]
>>>
>>> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45
>>> mins to close application. I am not sure what spark is doing after all
>>> tasks are processes successfully.
>>> I checked thread dump (using UI executor tab) on few executors but
>>> couldnt find anything major. Overall, few shuffle-client processes are
>>> "RUNNABLE" and few dispatched-* processes are "WAITING".
>>>
>>> Please let me know what spark is doing at this stage(after all tasks
>>> finished) and any way I can optimize it.
>>>
>>> Thanks
>>> Swapnil
>>>
>>>
>>>
>>
>

Reply via email to