[jira] [Commented] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated
[ https://issues.apache.org/jira/browse/YARN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029363#comment-15029363 ] lachisis commented on YARN-4382: I have tested for the "release_agent" feature, and think it is suitable. Jun Gong , do you make the patch now? If not, I will assignee to me and make. > Container hierarchy in cgroup may remain for ever after the container have be > terminated > > > Key: YARN-4382 > URL: https://issues.apache.org/jira/browse/YARN-4382 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.2 >Reporter: lachisis >Assignee: Jun Gong > > If we use LinuxContainerExecutor to executor the containers, this question > may happens. > In the common case, when a container run, a corresponding hierarchy will be > created in cgroup dir. And when the container terminate, the hierarchy will > be delete in some seconds(this time can be configured by > yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms). > In the code, I find that, CgroupsLCEResource send a signal to kill container > process asynchronously, and in the same time, it will try to delete the > container hierarchy in configured "delete-delay-ms" times. > But if the container process be killed for seconds which large than > "delete-delay-ms" time, the container hierarchy will remain for ever. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated
[ https://issues.apache.org/jira/browse/YARN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023481#comment-15023481 ] lachisis commented on YARN-4382: Thanks for your reply, Jun Gong. I think it is a good idea to use "release_agent" to clear the empty container hierarchys. But I am afaid that does "release_agent" option suit all the cgroup versions? I just test "release_agent" option, maybe some mistake, it does not work now. > Container hierarchy in cgroup may remain for ever after the container have be > terminated > > > Key: YARN-4382 > URL: https://issues.apache.org/jira/browse/YARN-4382 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.2 >Reporter: lachisis >Assignee: Jun Gong > > If we use LinuxContainerExecutor to executor the containers, this question > may happens. > In the common case, when a container run, a corresponding hierarchy will be > created in cgroup dir. And when the container terminate, the hierarchy will > be delete in some seconds(this time can be configured by > yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms). > In the code, I find that, CgroupsLCEResource send a signal to kill container > process asynchronously, and in the same time, it will try to delete the > container hierarchy in configured "delete-delay-ms" times. > But if the container process be killed for seconds which large than > "delete-delay-ms" time, the container hierarchy will remain for ever. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated
[ https://issues.apache.org/jira/browse/YARN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021630#comment-15021630 ] lachisis commented on YARN-4382: If lots of container hierarchys remained, it will make the cpu busy of this node, even when no jobs are running. -- PerfTop: 129889 irqs/sec kernel:76.3% [10 cycles], (all, 16 CPUs) -- samplespcnt kernel function ___ _ ___ 117166.00 - 59.1% : tg_shares_up 35688.00 - 18.0% : _spin_lock_irqsave 12045.00 - 6.1% : __set_se_shares > Container hierarchy in cgroup may remain for ever after the container have be > terminated > > > Key: YARN-4382 > URL: https://issues.apache.org/jira/browse/YARN-4382 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.2 >Reporter: lachisis > > If we use LinuxContainerExecutor to executor the containers, this question > may happens. > In the common case, when a container run, a corresponding hierarchy will be > created in cgroup dir. And when the container terminate, the hierarchy will > be delete in some seconds(this time can be configured by > yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms). > In the code, I find that, CgroupsLCEResource send a signal to kill container > process asynchronously, and in the same time, it will try to delete the > container hierarchy in configured "delete-delay-ms" times. > But if the container process be killed for seconds which large than > "delete-delay-ms" time, the container hierarchy will remain for ever. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4382) Container hierarchy in cgroup may remain for ever after the container have be terminated
lachisis created YARN-4382: -- Summary: Container hierarchy in cgroup may remain for ever after the container have be terminated Key: YARN-4382 URL: https://issues.apache.org/jira/browse/YARN-4382 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.2 Reporter: lachisis If we use LinuxContainerExecutor to executor the containers, this question may happens. In the common case, when a container run, a corresponding hierarchy will be created in cgroup dir. And when the container terminate, the hierarchy will be delete in some seconds(this time can be configured by yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms). In the code, I find that, CgroupsLCEResource send a signal to kill container process asynchronously, and in the same time, it will try to delete the container hierarchy in configured "delete-delay-ms" times. But if the container process be killed for seconds which large than "delete-delay-ms" time, the container hierarchy will remain for ever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4378) FairScheduler could not support DRF policy in lead queue when parent queue is fair policy
lachisis created YARN-4378: -- Summary: FairScheduler could not support DRF policy in lead queue when parent queue is fair policy Key: YARN-4378 URL: https://issues.apache.org/jira/browse/YARN-4378 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.2 Reporter: lachisis If I configure fair-scheduler.xml as following, then the application submitted to queue root.queueA.queueA1 will keep on Accepted status. And the resource requirement of it's task will not be satisfied, because the queue root.queueA.queueA1 have zero cpu. fair drf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581576#comment-14581576 ] lachisis commented on YARN-3795: Yes, I have found Len error in zookeeper server as Following: 2015-06-05 06:06:52,976 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZookeeperServer@897] - auth success /134.41.33.88:49189 2015-06-05 06:06:53,007 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x34db2f72ac50c86 due to java.io.IoException: Len error 1113979 2015-06-05 06:06:53,008 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Close socket connection for client /134/41/33.88:49189 which bad sessionid 0x34db2f72ac50c86 ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581580#comment-14581580 ] lachisis commented on YARN-3795: But I think it is not a good way to change jute.maxbuffer size. Because there is no larger znode in ZKRMStateStore. this Exception is caused by larger numbers of Watcher. And I think these Watchers seems not necessary ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581584#comment-14581584 ] lachisis commented on YARN-3795: On, I checked the YARN-3469. It seems resolve the problem. A moment... ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581589#comment-14581589 ] lachisis commented on YARN-3795: Emm, Could anyone tell me how to close the issus. I find YARN-3469 have resolved the problem. ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581534#comment-14581534 ] lachisis commented on YARN-3795: It is better if zookeeper fix the ZOOKEEPER-706. ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581537#comment-14581537 ] lachisis commented on YARN-3795: But I think most of these Watchers in ZKRMStateStore seems not necessary. ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
lachisis created YARN-3795: -- Summary: ZKRMStateStore crashes due to IOException: Broken pipe Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581517#comment-14581517 ] lachisis commented on YARN-3795: This exception appears two days ago in a yarn platform. there are about 7000+ history jobs in rmstore. Then one time, Activate ReourceManager find session expiry and transitionToStandby. meanwhile, the standby ReourceManager start to transitionToActive, but Throw exception as attached above. ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at
[jira] [Commented] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581531#comment-14581531 ] lachisis commented on YARN-3795: I have found ZOOKEEPER-706, this means if zookeeper server receive a request which the body size is larger than 1M, the server will throw exception Broken pipe to reject the request. this feature is used to limit the body size of Znode. By scanning the zookeeper snapshot, I do not find a znode created by ZKRMStateStore which have large data size. Then analyzing code, I find large numbers of Watcher are set when call function of loadRMAppState and loadApplicationAttemptState. ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65)
[jira] [Resolved] (YARN-3795) ZKRMStateStore crashes due to IOException: Broken pipe
[ https://issues.apache.org/jira/browse/YARN-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lachisis resolved YARN-3795. Resolution: Duplicate Fix Version/s: 2.7.1 ZKRMStateStore crashes due to IOException: Broken pipe -- Key: YARN-3795 URL: https://issues.apache.org/jira/browse/YARN-3795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical Fix For: 2.7.1 2015-06-05 06:06:54,848 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap88/134.41.33.88:2181, initiating session 2015-06-05 06:06:54,876 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap88/134.41.33.88:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:54,881 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:54,881 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap88/134.41.33.88:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:54,986 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dap87/134.41.33.87:2181. Will not attempt to authenticate using SASL (unknown error) 2015-06-05 06:06:55,278 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dap87/134.41.33.87:2181, initiating session 2015-06-05 06:06:55,330 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dap87/134.41.33.87:2181, sessionid = 0x34db2f72ac50c86, negotiated timeout = 1 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2015-06-05 06:06:55,343 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-05 06:06:55,344 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-05 06:06:55,345 WARN org.apache.zookeeper.ClientCnxn: Session 0x34db2f72ac50c86 for server dap87/134.41.33.87:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1075) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lachisis updated YARN-3614: --- Attachment: YARN-3614-1.patch FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0, 2.7.0 Reporter: lachisis Priority: Critical Labels: patch Fix For: 2.7.1 Attachments: YARN-3614-1.patch FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541287#comment-14541287 ] lachisis commented on YARN-3614: Yes it is. But need to configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. Before modify the configure, it will cost ten minutes to switch to active when four thousand apps in rmstore. that situation is not comfortable. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537625#comment-14537625 ] lachisis commented on YARN-3614: Yes, it is ok to check the existence of the directory first. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537628#comment-14537628 ] lachisis commented on YARN-3614: Yes, it is ok to check the existence of the directory first. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537624#comment-14537624 ] lachisis commented on YARN-3614: Yes, it is ok to check the existence of the directory first. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537626#comment-14537626 ] lachisis commented on YARN-3614: Yes, it is ok to check the existence of the directory first. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537640#comment-14537640 ] lachisis commented on YARN-3614: I used HA of yarn for stable service. Months later, I find when standby resourcemanager try to transitiontoActiver, it will cost more than ten minutes to load applications. So I backup the rmstore in hdfs and change the configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstroe. And find it work well when transition. Later my partner restore backuped rmstore, and submitted a new application, then find resoucemanager cashed. I know restoring backuped rmstore when resourcemanager running is not suitable. But this also means the processing logic of FileSystemRMStateStore is weak a liitle. So I suggest a little change here. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537632#comment-14537632 ] lachisis commented on YARN-3614: Sorry, terrible network. How can i delete the repeated replys. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537645#comment-14537645 ] lachisis commented on YARN-3614: Thanks for the chance to provide the patch. I will submit the patch later. FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Created] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
lachisis created YARN-3614: -- Summary: FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at
[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash
[ https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537499#comment-14537499 ] lachisis commented on YARN-3614: Thanks for your attention. I have downladed the 2.7.0, and review the FileSystemRMStateStore.java implementation. But I think it dosen't fix the issue which I submitted. The followinf is the code of 2.7.0. If fs.delete return false, it still thows Exception. I think a warning is enough here. otherwise, if someone move this application folder manually, Exception will throw through function deleteFile, deleteFileWithRetries, removeApplicationStateInternal. @Override public synchronized void removeApplicationStateInternal( ApplicationStateData appState) throws Exception { ApplicationId appId = appState.getApplicationSubmissionContext().getApplicationId(); Path nodeRemovePath = getAppDir(rmAppRoot, appId); LOG.info(Removing info for app: + appId + at: + nodeRemovePath); deleteFileWithRetries(nodeRemovePath); } private void deleteFileWithRetries(final Path deletePath) throws Exception { new FSActionVoid() { @Override public Void run() throws Exception { deleteFile(deletePath); return null; } }.runWithRetries(); } private void deleteFile(Path deletePath) throws Exception { if(!fs.delete(deletePath, true)) { throw new Exception(Failed to delete + deletePath); } } FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash - Key: YARN-3614 URL: https://issues.apache.org/jira/browse/YARN-3614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: lachisis Priority: Critical FileSystemRMStateStore is only a accessorial plug-in of rmstore. When it failed to remove application, I think warning is enough, but now resourcemanager crashed. Recently, I configure yarn.resourcemanager.state-store.max-completed-applications to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash. The following is log: 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053 2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053 java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053 at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of