Hi there,

I meet with a “many Active jobs” issue when using direct kafka streaming on 
YARN. (spark 1.5, hadoop 2.6, CDH5.5.1)

The problem happens when kafka has almost NO traffic.

From application UI, I see many ‘active’ jobs are pending for hours. And 
finally the driver “Requesting 4 new executors because tasks are backlogged”.

But, when looking at the driver log of a ‘activity’ job, the log says the job 
is finished. So, why the application UI shows this job is activity like forever?

Thanks!


Here are related log info about one of the ‘activity’ jobs.
There are two stages: a reduceByKey follows a flatmap. The log says both stages 
are finished in ~20ms and the job also finishes in 64 ms.

Got job 6567
Final stage: ResultStage 9851(foreachRDD at
Parents of final stage: List(ShuffleMapStage 9850)
Missing parents: List(ShuffleMapStage 9850)
…
Finished task 0.0 in stage 9850.0 (TID 29551) in 20 ms
Removed TaskSet 9850.0, whose tasks have all completed, from pool
ShuffleMapStage 9850 (flatMap at OpaTransLogAnalyzeWithShuffle.scala:83) 
finished in 0.022 s
…
Submitting ResultStage 9851 (ShuffledRDD[16419] at reduceByKey at 
OpaTransLogAnalyzeWithShuffle.scala:83), which is now runnable
…
ResultStage 9851 (foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84) 
finished in 0.023 s
Job 6567 finished: foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84, took 
0.064372 s
Finished job streaming job 1468592373000 ms.1 from job set of time 
1468592373000 ms

Wei Lu

Reply via email to