[jira] [Commented] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated

2015-11-26 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029363#comment-15029363
 ] 

lachisis commented on YARN-4382:


I have tested for the "release_agent" feature, and think it is suitable.
Jun Gong , do you make the patch now?  If not, I will assignee to me and make.

> Container hierarchy in cgroup may remain for ever after the container have be 
> terminated
> 
>
> Key: YARN-4382
> URL: https://issues.apache.org/jira/browse/YARN-4382
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.2
>Reporter: lachisis
>Assignee: Jun Gong
>
> If we use LinuxContainerExecutor to executor the containers, this question 
> may happens.
> In the common case, when a container run, a corresponding hierarchy will be 
> created in cgroup dir. And when the container terminate, the hierarchy  will 
> be delete in some seconds(this time can be configured by 
> yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms).
> In the code, I find that, CgroupsLCEResource send a signal to kill container 
> process asynchronously, and in the same time, it will try to delete the 
> container hierarchy  in configured "delete-delay-ms" times. 
> But if the container process be killed for seconds which large than 
> "delete-delay-ms" time, the  container hierarchy  will remain for ever.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated

2015-11-23 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023481#comment-15023481
 ] 

lachisis commented on YARN-4382:


Thanks for your reply, Jun Gong. 
I think it is a good idea to use "release_agent" to clear the empty container 
hierarchys. But I am afaid that does "release_agent" option suit all the cgroup 
versions?
I just test "release_agent" option, maybe some mistake, it does not work now.


> Container hierarchy in cgroup may remain for ever after the container have be 
> terminated
> 
>
> Key: YARN-4382
> URL: https://issues.apache.org/jira/browse/YARN-4382
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.2
>Reporter: lachisis
>Assignee: Jun Gong
>
> If we use LinuxContainerExecutor to executor the containers, this question 
> may happens.
> In the common case, when a container run, a corresponding hierarchy will be 
> created in cgroup dir. And when the container terminate, the hierarchy  will 
> be delete in some seconds(this time can be configured by 
> yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms).
> In the code, I find that, CgroupsLCEResource send a signal to kill container 
> process asynchronously, and in the same time, it will try to delete the 
> container hierarchy  in configured "delete-delay-ms" times. 
> But if the container process be killed for seconds which large than 
> "delete-delay-ms" time, the  container hierarchy  will remain for ever.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated

2015-11-22 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021630#comment-15021630
 ] 

lachisis commented on YARN-4382:


If lots of container hierarchys remained, it will make the cpu busy of this 
node, even when no jobs are running.

--
   PerfTop:  129889 irqs/sec  kernel:76.3% [10 cycles],  (all, 16 CPUs)
--

 samplespcnt   kernel function
 ___   _   ___

   117166.00 - 59.1% : tg_shares_up
35688.00 - 18.0% : _spin_lock_irqsave
12045.00 -  6.1% : __set_se_shares


> Container hierarchy in cgroup may remain for ever after the container have be 
> terminated
> 
>
> Key: YARN-4382
> URL: https://issues.apache.org/jira/browse/YARN-4382
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.2
>Reporter: lachisis
>
> If we use LinuxContainerExecutor to executor the containers, this question 
> may happens.
> In the common case, when a container run, a corresponding hierarchy will be 
> created in cgroup dir. And when the container terminate, the hierarchy  will 
> be delete in some seconds(this time can be configured by 
> yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms).
> In the code, I find that, CgroupsLCEResource send a signal to kill container 
> process asynchronously, and in the same time, it will try to delete the 
> container hierarchy  in configured "delete-delay-ms" times. 
> But if the container process be killed for seconds which large than 
> "delete-delay-ms" time, the  container hierarchy  will remain for ever.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated

2015-11-22 Thread lachisis (JIRA)
lachisis created YARN-4382:
--

 Summary: Container hierarchy in cgroup may remain for ever after 
the container have be terminated
 Key: YARN-4382
 URL: https://issues.apache.org/jira/browse/YARN-4382
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.2
Reporter: lachisis


If we use LinuxContainerExecutor to executor the containers, this question may 
happens.
In the common case, when a container run, a corresponding hierarchy will be 
created in cgroup dir. And when the container terminate, the hierarchy  will be 
delete in some seconds(this time can be configured by 
yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms).

In the code, I find that, CgroupsLCEResource send a signal to kill container 
process asynchronously, and in the same time, it will try to delete the 
container hierarchy  in configured "delete-delay-ms" times. 
But if the container process be killed for seconds which large than 
"delete-delay-ms" time, the  container hierarchy  will remain for ever.




  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4378) FairScheduler could not support DRF policy in lead queue when parent queue is fair policy

2015-11-20 Thread lachisis (JIRA)
lachisis created YARN-4378:
--

 Summary: FairScheduler could not support DRF policy in lead queue 
when parent queue is fair policy
 Key: YARN-4378
 URL: https://issues.apache.org/jira/browse/YARN-4378
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.2
Reporter: lachisis


If I configure fair-scheduler.xml as following, then the application submitted 
to queue root.queueA.queueA1 will keep on Accepted status. And the resource 
requirement of it's task will not be satisfied, because the queue 
root.queueA.queueA1 have zero cpu. 


  

  fair
  
drf
  



  
  
  
  






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581576#comment-14581576
 ] 

lachisis commented on YARN-3795:


Yes, I have found Len error in zookeeper server as Following:
2015-06-05 06:06:52,976 [myid:2] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZookeeperServer@897] - auth success 
/134.41.33.88:49189
2015-06-05 06:06:53,007 [myid:2] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x34db2f72ac50c86 due to java.io.IoException: Len 
error 1113979
2015-06-05 06:06:53,008 [myid:2] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Close socket 
connection for client /134/41/33.88:49189 which bad sessionid 0x34db2f72ac50c86 

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)

[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581580#comment-14581580
 ] 

lachisis commented on YARN-3795:


But I think it is not a good way to change jute.maxbuffer size. 
Because there is no larger znode in ZKRMStateStore.  this Exception is caused 
by larger numbers of Watcher.
And I think these Watchers  seems not necessary

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at 

[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581584#comment-14581584
 ] 

lachisis commented on YARN-3795:


On, I checked the YARN-3469. It seems resolve the problem. 
A moment...

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581589#comment-14581589
 ] 

lachisis commented on YARN-3795:


Emm, Could anyone tell me how to close the issus.
I find YARN-3469 have resolved the problem.

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581534#comment-14581534
 ] 

lachisis commented on YARN-3795:


It is better if zookeeper fix the ZOOKEEPER-706. 

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581537#comment-14581537
 ] 

lachisis commented on YARN-3795:


But I think most of these Watchers in ZKRMStateStore  seems not necessary.

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)
lachisis created YARN-3795:
--

 Summary: ZKRMStateStore crashes due to IOException: Broken pipe
 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical


2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to dap88/134.41.33.88:2181, initiating session
2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server dap88/134.41.33.88:2181, sessionid = 
0x34db2f72ac50c86, negotiated timeout = 1
2015-06-05 06:06:54,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: None with state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-06-05 06:06:54,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2015-06-05 06:06:54,881 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing 
socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
2015-06-05 06:06:54,986 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: None with state:Disconnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-06-05 06:06:54,986 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session disconnected
2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate 
using SASL (unknown error)
2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to dap87/134.41.33.87:2181, initiating session
2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server dap87/134.41.33.87:2181, sessionid = 
0x34db2f72ac50c86, negotiated timeout = 1
2015-06-05 06:06:55,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: None with state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-06-05 06:06:55,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2015-06-05 06:06:55,344 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing 
socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581517#comment-14581517
 ] 

lachisis commented on YARN-3795:


This exception appears two days ago in a yarn platform.
there are about 7000+ history jobs in rmstore. Then one time, Activate 
ReourceManager find session expiry and transitionToStandby. 
meanwhile, the standby ReourceManager  start to transitionToActive, but Throw 
exception as attached above.

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 

[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581531#comment-14581531
 ] 

lachisis commented on YARN-3795:


I have found ZOOKEEPER-706, this means if zookeeper server receive a request 
which the body size is larger than 1M, the server will throw exception Broken 
pipe to reject the request.
this feature is used to limit the body size of Znode.

By scanning the zookeeper snapshot, I do not find a znode created by 
ZKRMStateStore which have large data size. 
Then analyzing code,  I find large numbers of Watcher are set when call 
function of loadRMAppState and loadApplicationAttemptState. 



 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   

[jira] [Resolved] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe

2015-06-11 Thread lachisis (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lachisis resolved YARN-3795.

   Resolution: Duplicate
Fix Version/s: 2.7.1

 ZKRMStateStore crashes due to IOException: Broken pipe
 --

 Key: YARN-3795
 URL: https://issues.apache.org/jira/browse/YARN-3795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical
 Fix For: 2.7.1


 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap88/134.41.33.88:2181, initiating session
 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap88/134.41.33.88:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:54,881 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:Disconnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:54,986 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session disconnected
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server dap87/134.41.33.87:2181. Will not attempt to 
 authenticate using SASL (unknown error)
 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to dap87/134.41.33.87:2181, initiating session
 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server dap87/134.41.33.87:2181, sessionid = 
 0x34db2f72ac50c86, negotiated timeout = 1
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Watcher event type: None with state:SyncConnected for path:null for Service 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
 2015-06-05 06:06:55,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-05 06:06:55,344 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 
 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, 
 closing socket connection and attempting reconnect
 java.io.IOException: Broken pipe
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
   at sun.nio.ch.IOUtil.write(IOUtil.java:65)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
   at 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-20 Thread lachisis (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lachisis updated YARN-3614:
---
Attachment: YARN-3614-1.patch

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0, 2.7.0
Reporter: lachisis
Priority: Critical
  Labels: patch
 Fix For: 2.7.1

 Attachments: YARN-3614-1.patch


 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-12 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541287#comment-14541287
 ] 

lachisis commented on YARN-3614:


Yes it is. But need to configure 
yarn.resourcemanager.state-store.max-completed-applications  to limit 
applications number in rmstore. 
Before modify the configure, it will cost ten minutes to switch to active when 
four thousand apps in rmstore. that situation is not comfortable.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537625#comment-14537625
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537628#comment-14537628
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537624#comment-14537624
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537626#comment-14537626
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537640#comment-14537640
 ] 

lachisis commented on YARN-3614:


I used HA of yarn for stable service. 
Months later, I find when standby resourcemanager try to transitiontoActiver, 
it will cost more than ten minutes to load applications. So I backup the 
rmstore in hdfs and change the configure 
yarn.resourcemanager.state-store.max-completed-applications to limit 
applications number in rmstroe. And find it work well when transition.
Later my partner restore backuped rmstore, and submitted a new application, 
then find resoucemanager cashed.

I know restoring backuped rmstore when resourcemanager running is not suitable. 
But this also means the processing logic of FileSystemRMStateStore is weak a 
liitle. So I suggest a little change here.
 



 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537632#comment-14537632
 ] 

lachisis commented on YARN-3614:


Sorry, terrible network.  How can i delete the repeated replys.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537645#comment-14537645
 ] 

lachisis commented on YARN-3614:


Thanks for the chance to provide the patch.
I will submit the patch later.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Created] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread lachisis (JIRA)
lachisis created YARN-3614:
--

 Summary: FileSystemRMStateStore throw exception when failed to 
remove application, that cause resourcemanager to crash
 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical


FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
When it failed to remove application, I think warning is enough, but now 
resourcemanager crashed.

Recently, I configure 
yarn.resourcemanager.state-store.max-completed-applications  to limit 
applications number in rmstore. when applications number exceed the limit, some 
old applications will be removed. If failed to remove, resourcemanager will 
crash.
The following is log: 

2015-05-11 06:58:43,815 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
info for app: application_1430994493305_0053
2015-05-11 06:58:43,815 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: 
Removing info for app: application_1430994493305_0053 at: 
/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
2015-05-11 06:58:43,816 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
removing app: application_1430994493305_0053
java.lang.Exception: Failed to delete 
/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-05-11 06:58:43,819 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
java.lang.Exception: Failed to delete 
/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
at 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537499#comment-14537499
 ] 

lachisis commented on YARN-3614:


Thanks for your attention. 
I have downladed the 2.7.0, and review the FileSystemRMStateStore.java 
implementation. 
But I think it dosen't fix the issue which I submitted.

The followinf is the code of 2.7.0. If fs.delete return false, it still thows 
Exception.  I think a warning is enough here. otherwise, if someone move this 
application folder manually,  Exception will throw through function 
deleteFile, deleteFileWithRetries, removeApplicationStateInternal.

@Override
  public synchronized void removeApplicationStateInternal(
  ApplicationStateData appState)
  throws Exception {
ApplicationId appId =
appState.getApplicationSubmissionContext().getApplicationId();
Path nodeRemovePath = getAppDir(rmAppRoot, appId);
LOG.info(Removing info for app:  + appId +  at:  + nodeRemovePath);
deleteFileWithRetries(nodeRemovePath);
  }

private void deleteFileWithRetries(final Path deletePath) throws Exception {
new FSActionVoid() {
  @Override
  public Void run() throws Exception {
deleteFile(deletePath);
return null;
  }
}.runWithRetries();
  }

private void deleteFile(Path deletePath) throws Exception {
if(!fs.delete(deletePath, true)) {
  throw new Exception(Failed to delete  + deletePath);
}
  }





 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of