Thank you for the suggestion.

I tried the df.coalesce(1000).write.parquet() and yes, the parquet file
number drops to 1000, but the parition of parquet stills is like 5000+.
When I read the parquet and do a count, it still has the 5000+ tasks.

So I guess I need to do a repartition here to drop task number?  But
repartition never works for me, always failed due to out of memory.

And regarding the large number task delay problem, I found a similar
problem: https://issues.apache.org/jira/browse/SPARK-7447.

I am unionALL like 10 parquet folder, with totally 70K+ parquet files,
generating 70k+ taskes. It took around 5-8 mins before all tasks start just
like the ticket abover.

It also happens if I do a partition discovery with base path.    Is there
any schema inference or checking doing, which causes the slowness?

Thanks,
Gavin



On Mon, Jan 11, 2016 at 1:21 PM, Shixiong(Ryan) Zhu <[email protected]
> wrote:

> Could you use "coalesce" to reduce the number of partitions?
>
>
> Shixiong Zhu
>
>
> On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue <[email protected]>
> wrote:
>
>> Here is more info.
>>
>> The job stuck at:
>> INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks
>>
>> Then got the error:
>> Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out
>> after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
>>
>> So I increased spark.network.timeout from 120s to 600s.  It sometimes
>> works.
>>
>> Each task is a parquet file.  I could not repartition due to out of GC
>> problem.
>>
>> Is there any way I could to improve the performance?
>>
>> Thanks,
>> Gavin
>>
>>
>> On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue <[email protected]>
>> wrote:
>>
>>> Hey,
>>>
>>> I have 10 days data, each day has a parquet directory with over 7000
>>> partitions.
>>> So when I union 10 days and do a count, then it submits over 70K tasks.
>>>
>>> Then the job failed silently with one container exit with code 1.  The
>>> union with like 5, 6 days data is fine.
>>> In the spark-shell, it just hang showing: Yarn scheduler submit 70000+
>>> tasks.
>>>
>>> I am running spark 1.6 over hadoop 2.7.  Is there any setting I could
>>> change to make this work?
>>>
>>> Thanks,
>>> Gavin
>>>
>>>
>>>
>>
>

Reply via email to