[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

zhihai xu (JIRA) Wed, 10 Jun 2015 12:24:48 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580965#comment-14580965
 ]


zhihai xu commented on YARN-3790:
---------------------------------

[~rohithsharma], thanks for updating the title.
The containers are recovered. {{rootMetrics}}'s used resource is also updated, 
But {{rootMetrics}}'s available resource is not updated.
The following logs in the failed test proved it:
{code}
2015-06-09 22:55:42,964 INFO  [ResourceManager Event Processor] 
fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 
127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433915736884_0001 of type NODE_UPDATE
2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:processNodeUpdate(820)) - Received node update 
event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is 
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default 
is <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.default 
<memory:0, vCores:0>; Total resource consumption for root now <memory:0, 
vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from 
application_1433915736884_0001 <memory:0, vCores:0>; Total resource consumption 
for root.zxu now <memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is 
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is 
<memory:0, vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu 
<memory:0, vCores:0>; Total resource consumption for root now <memory:0, 
vCores:0>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(177)) - The updated demand for root is 
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSQueue 
(FSQueue.java:setFairShare(196)) - The updated fairShare for root is 
<memory:8192, vCores:8>
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 1 } container_state: 
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 0 
nodeLabelExpression: ""
2015-06-09 22:55:42,965 DEBUG [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing 
container_1433915736884_0001_01_000001 of type RECOVER
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
 APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - 
container_1433915736884_0001_01_000001 Container Transitioned from NEW to 
RUNNING
2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433915736884_0001 of type APP_RUNNING_ON_NODE
2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned 
container container_1433915736884_0001_01_000001 of capacity <memory:1024, 
vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024, 
vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt 
appattempt_1433915736884_0001_000001 is recovering container 
container_1433915736884_0001_01_000001
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 2 } container_state: 
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 0 
nodeLabelExpression: ""
2015-06-09 22:55:42,966 DEBUG [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing 
container_1433915736884_0001_01_000002 of type RECOVER
2015-06-09 22:55:42,966 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
 APP_RUNNING_ON_NODE
2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - 
container_1433915736884_0001_01_000002 Container Transitioned from NEW to 
RUNNING
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433915736884_0001 of type APP_RUNNING_ON_NODE
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned 
container container_1433915736884_0001_01_000002 of capacity <memory:1024, 
vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048, 
vCores:2> used and <memory:6144, vCores:6> available after allocation
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt 
appattempt_1433915736884_0001_000001 is recovering container 
container_1433915736884_0001_01_000002
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433915736884 } attemptId: 1 } id: 3 } container_state: 
C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 0 
nodeLabelExpression: ""
2015-06-09 22:55:42,967 DEBUG [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing 
container_1433915736884_0001_01_000003 of type RECOVER
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType:
 CONTAINER_FINISHED
2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - 
container_1433915736884_0001_01_000003 Container Transitioned from NEW to 
COMPLETED
2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] 
attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event 
for appattempt_1433915736884_0001_000001 of type CONTAINER_FINISHED
Container State is : RUNNING
Container State is : RUNNING
Received completed containers [ContainerStatus: [ContainerId: 
container_1433915736884_0001_01_000003, State: COMPLETE, Diagnostics: recover 
container, ExitStatus: 0, ]]
2015-06-09 22:55:43,266 DEBUG [Thread-645] fair.FSAppAttempt 
(FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for 
application_1433915736884_0001:Min((queueFairShare=<memory:8192, vCores:8> - 
queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144, 
vCores:6>Headroom=<memory:6144, vCores:6>
{code}
The Headroom calculation log show queueUsage=<memory:2048, vCores:2>, the code 
which update  {{rootMetrics}}'s used resource is at 
{{AppSchedulingInfo#recoverContainer}}
{code}
metrics.allocateResources(user, 1, rmContainer.getAllocatedResource(), false);
{code}
And we can find the {{update}}/{{updateDemand}} is called by UpdateThread 
before {{recoverContainersOnNode}}.

The following is the log from the succeeded test.
{code}
2015-06-09 23:40:53,432 INFO  [ResourceManager Event Processor] 
fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 
127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433918447351_0001 of type NODE_UPDATE
2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:processNodeUpdate(820)) - Received node update 
event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
2015-06-09 23:40:53,432 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433918447351 } attemptId: 1 } id: 1 } container_state: 
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 0 
nodeLabelExpression: ""
2015-06-09 23:40:53,433 DEBUG [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing 
container_1433918447351_0001_01_000001 of type RECOVER
2015-06-09 23:40:53,433 INFO  [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - 
container_1433918447351_0001_01_000001 Container Transitioned from NEW to 
RUNNING
2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
 APP_RUNNING_ON_NODE
2015-06-09 23:40:53,433 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned 
container container_1433918447351_0001_01_000001 of capacity <memory:1024, 
vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024, 
vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433918447351_0001 of type APP_RUNNING_ON_NODE
2015-06-09 23:40:53,434 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt 
appattempt_1433918447351_0001_000001 is recovering container 
container_1433918447351_0001_01_000001
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433918447351 } attemptId: 1 } id: 2 } container_state: 
C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 0 
nodeLabelExpression: ""
2015-06-09 23:40:53,435 DEBUG [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing 
container_1433918447351_0001_01_000002 of type RECOVER
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - 
container_1433918447351_0001_01_000002 Container Transitioned from NEW to 
RUNNING
2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType:
 APP_RUNNING_ON_NODE
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned 
container container_1433918447351_0001_01_000002 of capacity <memory:1024, 
vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048, 
vCores:2> used and <memory:6144, vCores:6> available after allocation
2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(756)) - Processing event for 
application_1433918447351_0001 of type APP_RUNNING_ON_NODE
2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt 
appattempt_1433918447351_0001_000001 is recovering container 
container_1433918447351_0001_01_000002
2015-06-09 23:40:53,436 INFO  [ResourceManager Event Processor] 
scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering 
container container_id { app_attempt_id { application_id { id: 1 
cluster_timestamp: 1433918447351 } attemptId: 1 } id: 3 } container_state: 
C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } 
diagnostics: "recover container" container_exit_status: 0 creation_time: 0 
nodeLabelExpression: ""
2015-06-09 23:40:53,436 DEBUG [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing 
container_1433918447351_0001_01_000003 of type RECOVER
2015-06-09 23:40:53,436 INFO  [ResourceManager Event Processor] 
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - 
container_1433918447351_0001_01_000003 Container Transitioned from NEW to 
COMPLETED
2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType:
 CONTAINER_FINISHED
2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] 
attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event 
for appattempt_1433918447351_0001_000001 of type CONTAINER_FINISHED
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is 
<memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default 
is <memory:0, vCores:0>
2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.default 
<memory:0, vCores:0>; Total resource consumption for root now <memory:0, 
vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from 
application_1433918447351_0001 <memory:2048, vCores:2>; Total resource 
consumption for root.zxu now <memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is 
<memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue 
(FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is 
<memory:0, vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu 
<memory:2048, vCores:2>; Total resource consumption for root now <memory:0, 
vCores:0>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue 
(FSParentQueue.java:updateDemand(177)) - The updated demand for root is 
<memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647>
2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSQueue 
(FSQueue.java:setFairShare(196)) - The updated fairShare for root is 
<memory:8192, vCores:8>
Container State is : RUNNING
Container State is : RUNNING
Received completed containers [ContainerStatus: [ContainerId: 
container_1433918447351_0001_01_000003, State: COMPLETE, Diagnostics: recover 
container, ExitStatus: 0, ]]
2015-06-09 23:40:53,632 INFO  [Thread-643] 
resourcemanager.TestWorkPreservingRMRestart 
(TestWorkPreservingRMRestart.java:testSchedulerRecovery(240)) - 
testSchedulerRecovery end
2015-06-09 23:40:53,733 DEBUG [Thread-643] fair.FSAppAttempt 
(FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for 
application_1433918447351_0001:Min((queueFairShare=<memory:8192, vCores:8> - 
queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144, 
vCores:6>Headroom=<memory:6144, vCores:6>
{code}
And we can find the {{update}}/{{updateDemand}} is called by UpdateThread after 
{{recoverContainersOnNode}}. The test passed, because {{update}} set 
{{rootMetrics}}'s available resource correctly after 
{{recoverContainersOnNode}} updated  {{rootMetrics}}'s used resource.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3790
>                 URL: https://issues.apache.org/jira/browse/YARN-3790
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test
>            Reporter: Rohith
>            Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at org.junit.Assert.assertEquals(Assert.java:542)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

Reply via email to