Thank you for the suggestion. I tried the df.coalesce(1000).write.parquet() and yes, the parquet file number drops to 1000, but the parition of parquet stills is like 5000+. When I read the parquet and do a count, it still has the 5000+ tasks.
So I guess I need to do a repartition here to drop task number? But repartition never works for me, always failed due to out of memory. And regarding the large number task delay problem, I found a similar problem: https://issues.apache.org/jira/browse/SPARK-7447. I am unionALL like 10 parquet folder, with totally 70K+ parquet files, generating 70k+ taskes. It took around 5-8 mins before all tasks start just like the ticket abover. It also happens if I do a partition discovery with base path. Is there any schema inference or checking doing, which causes the slowness? Thanks, Gavin On Mon, Jan 11, 2016 at 1:21 PM, Shixiong(Ryan) Zhu <[email protected] > wrote: > Could you use "coalesce" to reduce the number of partitions? > > > Shixiong Zhu > > > On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue <[email protected]> > wrote: > >> Here is more info. >> >> The job stuck at: >> INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks >> >> Then got the error: >> Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out >> after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout >> >> So I increased spark.network.timeout from 120s to 600s. It sometimes >> works. >> >> Each task is a parquet file. I could not repartition due to out of GC >> problem. >> >> Is there any way I could to improve the performance? >> >> Thanks, >> Gavin >> >> >> On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue <[email protected]> >> wrote: >> >>> Hey, >>> >>> I have 10 days data, each day has a parquet directory with over 7000 >>> partitions. >>> So when I union 10 days and do a count, then it submits over 70K tasks. >>> >>> Then the job failed silently with one container exit with code 1. The >>> union with like 5, 6 days data is fine. >>> In the spark-shell, it just hang showing: Yarn scheduler submit 70000+ >>> tasks. >>> >>> I am running spark 1.6 over hadoop 2.7. Is there any setting I could >>> change to make this work? >>> >>> Thanks, >>> Gavin >>> >>> >>> >> >
