Mostly, that particular executor is stuck on GC Pause, what operation are
you performing? You can try increasing the parallelism if you see only 1
executor is doing the task.

Thanks
Best Regards

On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva <
twinkle.sachd...@gmail.com> wrote:

> Hi,
>
> I am running a spark application on Yarn in cluster mode.
> One of my executor appears to be in hang state, for  a long time, and gets
> finally killed by the driver.
>
> As compared to other executors, It have not received StopExecutor message
> from the driver.
>
> Here are the logs at the end of this container (C_1):
>
> --------------------------------------------------------------------------------
> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
> broadcast 36, response is 2
> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
> TMO-GCR70/192.168.162.70:9000 from admin: closed
> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
> TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections 0
> 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
> with renew id 1 executed
> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
> with renew id 1 expired
> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
> with renew id 1 exited
> 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
> SIGNAL 15: SIGTERM
>
> NOTE that it has no logs for more than 2hrs.
>
> Here are the logs at the end of normal container ( C_2):
>
> ------------------------------------------------------------------------------------
> 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
> received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
> :37906/user/CoarseGrainedScheduler#160899257]
> 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
> commanded a shutdown
> 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
> 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] 
> *handled
> message (181.499835 ms) StopExecutor* from
> Actor[akka.tcp://sparkDriver@TMO-DN73
> :37906/user/CoarseGrainedScheduler#160899257]
> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
> Shutting down remote daemon.
> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
> Remoting shut down.
> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
> org.apache.hadoop.ipc.Client@76a68bd4
> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
> org.apache.hadoop.ipc.Client@76a68bd4
> 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
> org.apache.hadoop.ipc.Client@76a68bd4
> 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no more
> references remain: org.apache.hadoop.ipc.Client@76a68bd4
> 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
> 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
> 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called
>
> At the driver side, i can see the logs related to heartbeat messages from
> C_1 till 20:05:00
>
> ------------------------------------------------------------------------------------------
> 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message
> Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106))
> from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn]
>
> After this, it continues to receive the heartbeat from other executors
> except this one, and here follows the message responsible for it's SIGTERM:
>
>
> ------------------------------------------------------------------------------------------------------------
>
> 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
> 80515ms exceeds 45000ms
>
>
> I am using spark 1.2.1.
>
> Any pointer(s) ?
>
>
> Thanks,
>
> Twinkle
>

Reply via email to