HI Jeroen,

Can I get a few pieces of additional information please?

In the EMR cluster what are the other applications that you have enabled
(like HIVE, FLUME, Livy, etc).
Are you using SPARK Session? If yes is your application using cluster mode
or client mode?
Have you read the EC2 service level agreement?
Is your cluster on auto scaling group?
Are you scheduling your job by adding another new step into the EMR
cluster? Or is it the same job running always triggered by some background
process?
Since EMR are supposed to be ephemeral, have you tried creating a new
cluster and trying your job in that?


Regards,
Gourav Sengupta

On Thu, Dec 28, 2017 at 4:06 PM, Jeroen Miller <bluedasya...@gmail.com>
wrote:

> Dear Sparkers,
>
> Once again in times of desperation, I leave what remains of my mental
> sanity to this wise and knowledgeable community.
>
> I have a Spark job (on EMR 5.8.0) which had been running daily for months,
> if not the whole year, with absolutely no supervision. This changed all of
> sudden for reasons I do not understand.
>
> The volume of data processed daily has been slowly increasing over the
> past year but has been stable in the last couple months. Since I'm only
> processing the past 8 days's worth of data I do not think that increased
> data volume is to blame here. Yes, I did check the volume of data for the
> past few days.
>
> Here is a short description of the issue.
>
> - The Spark job starts normally and proceeds successfully with the first
> few stages.
> - Once we reach the dreaded stage, all tasks are performed successfully
> (they typically take not more than 1 minute each), except for the /very/
> first one (task 0.0) which never finishes.
>
> Here is what the log looks like (simplified for readability):
>
> ----------------------------------------
> INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412
> ms on ... (executor 12) (254/256)
> INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394
> ms on ... (executor 7) (255/256)
> INFO ExecutorAllocationManager: Request to remove executorIds: 14
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 14
> INFO YarnAllocator: Driver requested a total number of 0 executor(s).
> ----------------------------------------
>
> Why is that? There is still a task waiting to be completed right? Isn't an
> executor needed for that?
>
> Afterwards, all executors are getting killed (dynamic allocation is turned
> on):
>
> ----------------------------------------
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
> INFO ExecutorAllocationManager: Removing executor 14 because it has been
> idle for 60 seconds (new desired total will be 5)
>     .
>     .
>     .
> INFO ExecutorAllocationManager: Request to remove executorIds: 7
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 7
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
> INFO ExecutorAllocationManager: Removing executor 7 because it has been
> idle for 60 seconds (new desired total will be 1)
> INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> INFO DAGScheduler: Executor lost: 7 (epoch 4)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from
> BlockManagerMaster.
> INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7,
> ..., 44289, None)
> INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
> INFO ExecutorAllocationManager: Existing executor 7 has been removed (new
> total is 1)
> ----------------------------------------
>
> Then, there's nothing more in the driver's log. Nothing. The cluster then
> run for hours, with no progress being made, and no executors allocated.
>
> Here is what I tried:
>
>     - More memory per executor: from 13 GB to 24 GB by increments.
>     - Explicit repartition() on the RDD: from 128 to 256 partitions.
>
> The offending stage used to be a rather innocent looking keyBy(). After
> adding some repartition() the offending stage was then a mapToPair().
> During my last experiments, it turned out the repartition(256) itself is
> now the culprit.
>
> I like Spark, but its mysteries will manage to send me in a mental
> hospital one of those days.
>
> Can anyone shed light on what is going on here, or maybe offer some
> suggestions or pointers to relevant source of information?
>
> I am completely clueless.
>
> Seasons greetings,
>
> Jeroen
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to