[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2015-08-27 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2816:
--
Fix Version/s: 2.6.1

Pulled this into 2.6.1. Ran compilation and TestNMLeveldbStateStoreService 
before the push. Patch applied cleanly.

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
  Labels: 2.6.1-candidate
 Fix For: 2.7.0, 2.6.1

 Attachments: YARN-2816.000.patch, YARN-2816.001.patch, 
 YARN-2816.002.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2015-07-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2816:
--
Labels: 2.6.1-candidate  (was: )

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
  Labels: 2.6.1-candidate
 Fix For: 2.7.0

 Attachments: YARN-2816.000.patch, YARN-2816.001.patch, 
 YARN-2816.002.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-12 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Attachment: YARN-2816.002.patch

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch, YARN-2816.001.patch, 
 YARN-2816.002.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-07 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Attachment: YARN-2816.001.patch

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch, YARN-2816.001.patch, 
 leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Attachment: leveldb_records.txt

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-06 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Priority: Major  (was: Critical)

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-05 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Attachment: YARN-2816.000.patch

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-05 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Priority: Critical  (was: Major)

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2816.000.patch


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)