Not quite sure, but you can try increasing the spark.akka.threads, most likely it can be a yarn related issue.
Thanks Best Regards On Tue, Mar 3, 2015 at 3:38 PM, twinkle sachdeva <twinkle.sachd...@gmail.com > wrote: > Hi, > > Operations are not very extensive, as this scenario is not always > reproducible. > One of the executor start behaving in this manner. For this particular > application, we are using 8 cores in one executors, and practically, 4 > executors are launched on one machine. > > This machine has good config with respect to number of cores. > > Somehow, to me it seems to be some akka communication issue. If i try to > take thread dump of the executor, once it appears to be in trouble, then > time out happens. > > Can it be something related to* spark.akka.threads?* > > > > On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Mostly, that particular executor is stuck on GC Pause, what operation are >> you performing? You can try increasing the parallelism if you see only 1 >> executor is doing the task. >> >> Thanks >> Best Regards >> >> On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva < >> twinkle.sachd...@gmail.com> wrote: >> >>> Hi, >>> >>> I am running a spark application on Yarn in cluster mode. >>> One of my executor appears to be in hang state, for a long time, and >>> gets finally killed by the driver. >>> >>> As compared to other executors, It have not received StopExecutor >>> message from the driver. >>> >>> Here are the logs at the end of this container (C_1): >>> >>> -------------------------------------------------------------------------------- >>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing >>> broadcast 36, response is 2 >>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 >>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB] >>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection >>> to TMO-GCR70/192.168.162.70:9000 from admin: closed >>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection >>> to TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining >>> connections 0 >>> 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] >>> with renew id 1 executed >>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] >>> with renew id 1 expired >>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] >>> with renew id 1 exited >>> 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED >>> SIGNAL 15: SIGTERM >>> >>> NOTE that it has no logs for more than 2hrs. >>> >>> Here are the logs at the end of normal container ( C_2): >>> >>> ------------------------------------------------------------------------------------ >>> 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2 >>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b] >>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] >>> received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73 >>> :37906/user/CoarseGrainedScheduler#160899257] >>> 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver >>> commanded a shutdown >>> 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared >>> 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped >>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] >>> *handled >>> message (181.499835 ms) StopExecutor* from >>> Actor[akka.tcp://sparkDriver@TMO-DN73 >>> :37906/user/CoarseGrainedScheduler#160899257] >>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: >>> Shutting down remote daemon. >>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: >>> Remote daemon shut down; proceeding with flushing remote transports. >>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator: >>> Remoting shut down. >>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: >>> org.apache.hadoop.ipc.Client@76a68bd4 >>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache: >>> org.apache.hadoop.ipc.Client@76a68bd4 >>> 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache: >>> org.apache.hadoop.ipc.Client@76a68bd4 >>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no >>> more references remain: org.apache.hadoop.ipc.Client@76a68bd4 >>> 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client >>> 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called >>> 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called >>> >>> At the driver side, i can see the logs related to heartbeat messages >>> from C_1 till 20:05:00 >>> >>> ------------------------------------------------------------------------------------------ >>> 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received >>> message Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, >>> TMO-DN73, 34106)) from Actor[akka.tcp://sparkExecutor@TMO-DN73 >>> :43671/temp/$fn] >>> >>> After this, it continues to receive the heartbeat from other executors >>> except this one, and here follows the message responsible for it's SIGTERM: >>> >>> >>> ------------------------------------------------------------------------------------------------------------ >>> >>> 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing >>> BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats: >>> 80515ms exceeds 45000ms >>> >>> >>> I am using spark 1.2.1. >>> >>> Any pointer(s) ? >>> >>> >>> Thanks, >>> >>> Twinkle >>> >> >> >