Mostly, that particular executor is stuck on GC Pause, what operation are you performing? You can try increasing the parallelism if you see only 1 executor is doing the task.
Thanks Best Regards On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva < twinkle.sachd...@gmail.com> wrote: > Hi, > > I am running a spark application on Yarn in cluster mode. > One of my executor appears to be in hang state, for a long time, and gets > finally killed by the driver. > > As compared to other executors, It have not received StopExecutor message > from the driver. > > Here are the logs at the end of this container (C_1): > > -------------------------------------------------------------------------------- > 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing > broadcast 36, response is 2 > 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 > to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB] > 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to > TMO-GCR70/192.168.162.70:9000 from admin: closed > 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to > TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections 0 > 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] > with renew id 1 executed > 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] > with renew id 1 expired > 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] > with renew id 1 exited > 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL 15: SIGTERM > > NOTE that it has no logs for more than 2hrs. > > Here are the logs at the end of normal container ( C_2): > > ------------------------------------------------------------------------------------ > 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 > to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b] > 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] > received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73 > :37906/user/CoarseGrainedScheduler#160899257] > 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver > commanded a shutdown > 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared > 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped > 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] > *handled > message (181.499835 ms) StopExecutor* from > Actor[akka.tcp://sparkDriver@TMO-DN73 > :37906/user/CoarseGrainedScheduler#160899257] > 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Shutting down remote daemon. > 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: > org.apache.hadoop.ipc.Client@76a68bd4 > 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: > org.apache.hadoop.ipc.Client@76a68bd4 > 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache: > org.apache.hadoop.ipc.Client@76a68bd4 > 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no more > references remain: org.apache.hadoop.ipc.Client@76a68bd4 > 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client > 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called > 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called > > At the driver side, i can see the logs related to heartbeat messages from > C_1 till 20:05:00 > > ------------------------------------------------------------------------------------------ > 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message > Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106)) > from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn] > > After this, it continues to receive the heartbeat from other executors > except this one, and here follows the message responsible for it's SIGTERM: > > > ------------------------------------------------------------------------------------------------------------ > > 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing > BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats: > 80515ms exceeds 45000ms > > > I am using spark 1.2.1. > > Any pointer(s) ? > > > Thanks, > > Twinkle >