Re: One of the executor not getting StopExecutor message

2015-03-03 Thread twinkle sachdeva
Hi,

Operations are not very extensive, as this scenario is not always
reproducible.
One of the executor start behaving in this manner. For this particular
application, we are using 8 cores in one executors, and practically, 4
executors are launched on one machine.

This machine has good config with respect to number of cores.

Somehow, to me it seems to be some akka communication issue. If i try to
take thread dump of the executor, once it appears to be in trouble, then
time out happens.

Can it be something related to* spark.akka.threads?*



On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Mostly, that particular executor is stuck on GC Pause, what operation are
 you performing? You can try increasing the parallelism if you see only 1
 executor is doing the task.

 Thanks
 Best Regards

 On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi,

 I am running a spark application on Yarn in cluster mode.
 One of my executor appears to be in hang state, for  a long time, and
 gets finally killed by the driver.

 As compared to other executors, It have not received StopExecutor message
 from the driver.

 Here are the logs at the end of this container (C_1):

 
 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
 broadcast 36, response is 2
 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
 TMO-GCR70/192.168.162.70:9000 from admin: closed
 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
 TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections
 0
 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 executed
 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 expired
 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 exited
 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
 SIGNAL 15: SIGTERM

 NOTE that it has no logs for more than 2hrs.

 Here are the logs at the end of normal container ( C_2):

 
 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
 received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
 :37906/user/CoarseGrainedScheduler#160899257]
 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
 commanded a shutdown
 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] 
 *handled
 message (181.499835 ms) StopExecutor* from
 Actor[akka.tcp://sparkDriver@TMO-DN73
 :37906/user/CoarseGrainedScheduler#160899257]
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Shutting down remote daemon.
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remote daemon shut down; proceeding with flushing remote transports.
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remoting shut down.
 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no
 more references remain: org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called

 At the driver side, i can see the logs related to heartbeat messages from
 C_1 till 20:05:00

 --
 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message
 Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106))
 from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn]

 After this, it continues to receive the heartbeat from other executors
 except this one, and here follows the message responsible for it's SIGTERM:


 

 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
 BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
 80515ms exceeds 45000ms


 I am using spark 

Re: One of the executor not getting StopExecutor message

2015-03-03 Thread Akhil Das
Not quite sure, but you can try increasing the spark.akka.threads, most
likely it can be a yarn related issue.

Thanks
Best Regards

On Tue, Mar 3, 2015 at 3:38 PM, twinkle sachdeva twinkle.sachd...@gmail.com
 wrote:

 Hi,

 Operations are not very extensive, as this scenario is not always
 reproducible.
 One of the executor start behaving in this manner. For this particular
 application, we are using 8 cores in one executors, and practically, 4
 executors are launched on one machine.

 This machine has good config with respect to number of cores.

 Somehow, to me it seems to be some akka communication issue. If i try to
 take thread dump of the executor, once it appears to be in trouble, then
 time out happens.

 Can it be something related to* spark.akka.threads?*



 On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Mostly, that particular executor is stuck on GC Pause, what operation are
 you performing? You can try increasing the parallelism if you see only 1
 executor is doing the task.

 Thanks
 Best Regards

 On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi,

 I am running a spark application on Yarn in cluster mode.
 One of my executor appears to be in hang state, for  a long time, and
 gets finally killed by the driver.

 As compared to other executors, It have not received StopExecutor
 message from the driver.

 Here are the logs at the end of this container (C_1):

 
 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
 broadcast 36, response is 2
 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection
 to TMO-GCR70/192.168.162.70:9000 from admin: closed
 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection
 to TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining
 connections 0
 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 executed
 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 expired
 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 exited
 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
 SIGNAL 15: SIGTERM

 NOTE that it has no logs for more than 2hrs.

 Here are the logs at the end of normal container ( C_2):

 
 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
 received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
 :37906/user/CoarseGrainedScheduler#160899257]
 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
 commanded a shutdown
 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] 
 *handled
 message (181.499835 ms) StopExecutor* from
 Actor[akka.tcp://sparkDriver@TMO-DN73
 :37906/user/CoarseGrainedScheduler#160899257]
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Shutting down remote daemon.
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remote daemon shut down; proceeding with flushing remote transports.
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remoting shut down.
 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no
 more references remain: org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called

 At the driver side, i can see the logs related to heartbeat messages
 from C_1 till 20:05:00

 --
 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received
 message Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7,
 TMO-DN73, 34106)) from Actor[akka.tcp://sparkExecutor@TMO-DN73
 :43671/temp/$fn]

 After this, it continues to receive the heartbeat from other executors
 except this one, and here follows the message responsible for it's SIGTERM:


 

Re: One of the executor not getting StopExecutor message

2015-02-27 Thread Akhil Das
Mostly, that particular executor is stuck on GC Pause, what operation are
you performing? You can try increasing the parallelism if you see only 1
executor is doing the task.

Thanks
Best Regards

On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva 
twinkle.sachd...@gmail.com wrote:

 Hi,

 I am running a spark application on Yarn in cluster mode.
 One of my executor appears to be in hang state, for  a long time, and gets
 finally killed by the driver.

 As compared to other executors, It have not received StopExecutor message
 from the driver.

 Here are the logs at the end of this container (C_1):

 
 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
 broadcast 36, response is 2
 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
 TMO-GCR70/192.168.162.70:9000 from admin: closed
 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
 TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections 0
 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 executed
 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 expired
 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
 with renew id 1 exited
 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
 SIGNAL 15: SIGTERM

 NOTE that it has no logs for more than 2hrs.

 Here are the logs at the end of normal container ( C_2):

 
 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
 to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
 received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
 :37906/user/CoarseGrainedScheduler#160899257]
 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
 commanded a shutdown
 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] 
 *handled
 message (181.499835 ms) StopExecutor* from
 Actor[akka.tcp://sparkDriver@TMO-DN73
 :37906/user/CoarseGrainedScheduler#160899257]
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Shutting down remote daemon.
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remote daemon shut down; proceeding with flushing remote transports.
 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remoting shut down.
 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
 org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no more
 references remain: org.apache.hadoop.ipc.Client@76a68bd4
 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called

 At the driver side, i can see the logs related to heartbeat messages from
 C_1 till 20:05:00

 --
 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message
 Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106))
 from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn]

 After this, it continues to receive the heartbeat from other executors
 except this one, and here follows the message responsible for it's SIGTERM:


 

 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
 BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
 80515ms exceeds 45000ms


 I am using spark 1.2.1.

 Any pointer(s) ?


 Thanks,

 Twinkle