[
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580965#comment-14580965
]
zhihai xu commented on YARN-3790:
---------------------------------
[~rohithsharma], thanks for updating the title.
The containers are recovered. {{rootMetrics}}'s used resource is also updated,
But {{rootMetrics}}'s available resource is not updated.
The following logs in the failed test proved it:
{code}
2015-06-09 22:55:42,964 INFO [ResourceManager Event Processor]
fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node
127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:handle(756)) - Processing event for
application_1433915736884_0001 of type NODE_UPDATE
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:processNodeUpdate(820)) - Received node update
event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default
is <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.default
<memory:0, vCores:0>; Total resource consumption for root now <memory:0,
vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from
application_1433915736884_0001 <memory:0, vCores:0>; Total resource consumption
for root.zxu now <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is
<memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu
<memory:0, vCores:0>; Total resource consumption for root now <memory:0,
vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue
(FSParentQueue.java:updateDemand(177)) - The updated demand for root is
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSQueue
(FSQueue.java:setFairShare(196)) - The updated fairShare for root is
<memory:8192, vCores:8>
2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor]
scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering
container container_id { app_attempt_id { application_id { id: 1
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 1 } container_state:
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 }
diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 22:55:42,965 DEBUG [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing
container_1433915736884_0001_01_000001 of type RECOVER
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) -
container_1433915736884_0001_01_000001 Container Transitioned from NEW to
RUNNING
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:handle(756)) - Processing event for
application_1433915736884_0001 of type APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor]
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned
container container_1433915736884_0001_01_000001 of capacity <memory:1024,
vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024,
vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor]
scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt
appattempt_1433915736884_0001_000001 is recovering container
container_1433915736884_0001_01_000001
2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor]
scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering
container container_id { app_attempt_id { application_id { id: 1
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 2 } container_state:
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 }
diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 22:55:42,966 DEBUG [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing
container_1433915736884_0001_01_000002 of type RECOVER
2015-06-09 22:55:42,966 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) -
container_1433915736884_0001_01_000002 Container Transitioned from NEW to
RUNNING
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:handle(756)) - Processing event for
application_1433915736884_0001 of type APP_RUNNING_ON_NODE
2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor]
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned
container container_1433915736884_0001_01_000002 of capacity <memory:1024,
vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048,
vCores:2> used and <memory:6144, vCores:6> available after allocation
2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor]
scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt
appattempt_1433915736884_0001_000001 is recovering container
container_1433915736884_0001_01_000002
2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor]
scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering
container container_id { app_attempt_id { application_id { id: 1
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 3 } container_state:
C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 }
diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 22:55:42,967 DEBUG [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing
container_1433915736884_0001_01_000003 of type RECOVER
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType:
CONTAINER_FINISHED
2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) -
container_1433915736884_0001_01_000003 Container Transitioned from NEW to
COMPLETED
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler]
attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event
for appattempt_1433915736884_0001_000001 of type CONTAINER_FINISHED
Container State is : RUNNING
Container State is : RUNNING
Received completed containers [ContainerStatus: [ContainerId:
container_1433915736884_0001_01_000003, State: COMPLETE, Diagnostics: recover
container, ExitStatus: 0, ]]
2015-06-09 22:55:43,266 DEBUG [Thread-645] fair.FSAppAttempt
(FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for
application_1433915736884_0001:Min((queueFairShare=<memory:8192, vCores:8> -
queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144,
vCores:6>Headroom=<memory:6144, vCores:6>
{code}
The Headroom calculation log show queueUsage=<memory:2048, vCores:2>, the code
which update {{rootMetrics}}'s used resource is at
{{AppSchedulingInfo#recoverContainer}}
{code}
metrics.allocateResources(user, 1, rmContainer.getAllocatedResource(), false);
{code}
And we can find the {{update}}/{{updateDemand}} is called by UpdateThread
before {{recoverContainersOnNode}}.
The following is the log from the succeeded test.
{code}
2015-06-09 23:40:53,432 INFO [ResourceManager Event Processor]
fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node
127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:handle(756)) - Processing event for
application_1433918447351_0001 of type NODE_UPDATE
2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:processNodeUpdate(820)) - Received node update
event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2015-06-09 23:40:53,432 INFO [ResourceManager Event Processor]
scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering
container container_id { app_attempt_id { application_id { id: 1
cluster_timestamp: 1433918447351 } attemptId: 1 } id: 1 } container_state:
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 }
diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 23:40:53,433 DEBUG [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing
container_1433918447351_0001_01_000001 of type RECOVER
2015-06-09 23:40:53,433 INFO [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) -
container_1433918447351_0001_01_000001 Container Transitioned from NEW to
RUNNING
2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 23:40:53,433 INFO [ResourceManager Event Processor]
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned
container container_1433918447351_0001_01_000001 of capacity <memory:1024,
vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024,
vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:handle(756)) - Processing event for
application_1433918447351_0001 of type APP_RUNNING_ON_NODE
2015-06-09 23:40:53,434 INFO [ResourceManager Event Processor]
scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt
appattempt_1433918447351_0001_000001 is recovering container
container_1433918447351_0001_01_000001
2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor]
scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering
container container_id { app_attempt_id { application_id { id: 1
cluster_timestamp: 1433918447351 } attemptId: 1 } id: 2 } container_state:
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 }
diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 23:40:53,435 DEBUG [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing
container_1433918447351_0001_01_000002 of type RECOVER
2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) -
container_1433918447351_0001_01_000002 Container Transitioned from NEW to
RUNNING
2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
APP_RUNNING_ON_NODE
2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor]
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned
container container_1433918447351_0001_01_000002 of capacity <memory:1024,
vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048,
vCores:2> used and <memory:6144, vCores:6> available after allocation
2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl
(RMAppImpl.java:handle(756)) - Processing event for
application_1433918447351_0001 of type APP_RUNNING_ON_NODE
2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor]
scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt
appattempt_1433918447351_0001_000001 is recovering container
container_1433918447351_0001_01_000002
2015-06-09 23:40:53,436 INFO [ResourceManager Event Processor]
scheduler.AbstractYarnScheduler
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering
container container_id { app_attempt_id { application_id { id: 1
cluster_timestamp: 1433918447351 } attemptId: 1 } id: 3 } container_state:
C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 }
diagnostics: "recover container" container_exit_status: 0 creation_time: 0
nodeLabelExpression: ""
2015-06-09 23:40:53,436 DEBUG [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing
container_1433918447351_0001_01_000003 of type RECOVER
2015-06-09 23:40:53,436 INFO [ResourceManager Event Processor]
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) -
container_1433918447351_0001_01_000003 Container Transitioned from NEW to
COMPLETED
2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType:
CONTAINER_FINISHED
2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler]
attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event
for appattempt_1433918447351_0001_000001 of type CONTAINER_FINISHED
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default
is <memory:0, vCores:0>
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.default
<memory:0, vCores:0>; Total resource consumption for root now <memory:0,
vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from
application_1433918447351_0001 <memory:2048, vCores:2>; Total resource
consumption for root.zxu now <memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is
<memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is
<memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu
<memory:2048, vCores:2>; Total resource consumption for root now <memory:0,
vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue
(FSParentQueue.java:updateDemand(177)) - The updated demand for root is
<memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSQueue
(FSQueue.java:setFairShare(196)) - The updated fairShare for root is
<memory:8192, vCores:8>
Container State is : RUNNING
Container State is : RUNNING
Received completed containers [ContainerStatus: [ContainerId:
container_1433918447351_0001_01_000003, State: COMPLETE, Diagnostics: recover
container, ExitStatus: 0, ]]
2015-06-09 23:40:53,632 INFO [Thread-643]
resourcemanager.TestWorkPreservingRMRestart
(TestWorkPreservingRMRestart.java:testSchedulerRecovery(240)) -
testSchedulerRecovery end
2015-06-09 23:40:53,733 DEBUG [Thread-643] fair.FSAppAttempt
(FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for
application_1433918447351_0001:Min((queueFairShare=<memory:8192, vCores:8> -
queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144,
vCores:6>Headroom=<memory:6144, vCores:6>
{code}
And we can find the {{update}}/{{updateDemand}} is called by UpdateThread after
{{recoverContainersOnNode}}. The test passed, because {{update}} set
{{rootMetrics}}'s available resource correctly after
{{recoverContainersOnNode}} updated {{rootMetrics}}'s used resource.
> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in
> trunk for FS scheduler
> ------------------------------------------------------------------------------------------------
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
> Issue Type: Bug
> Components: test
> Reporter: Rohith
> Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec
> <<< FAILURE! - in
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
> Time elapsed: 6.502 sec <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
> at
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
> at
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)