[ 
https://issues.apache.org/jira/browse/YARN-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542923#comment-16542923
 ] 

Weiwei Yang edited comment on YARN-7748 at 7/13/18 11:33 AM:
-------------------------------------------------------------

Took a bit of time investigating this issue as it happens to us too. I can 
stably reproduce this issue by adding a sleep right after following line:
{code:java}
// Kill the application
cs.handle(new AppAttemptRemovedSchedulerEvent(am1.getApplicationAttemptId(),
        RMAppAttemptState.KILLED, false));
// Sleep a few seconds to wait all events are
// handled before verifying the metrics
Thread.sleep(3000);{code}
Like [~haibochen] mentioned, this issue is because 
{{LeafQueue#finishApplicationAttempt}} was called *twice* for *same* 
app-attempt, causing the metrics incorrectly deducted to *-1*
{code:java}
default #user-pending-applications: -1
{code}
This is how it happens
 # UT case killed the app attempt by triggering a 
{{AppAttemptRemovedSchedulerEvent}}, this will cause the leafQueue to 
{{removeApplicationAttempt}} immediately
 # Scheduler then releases all containers, including the AM container
 # When AM container is killed, it will trigger a 
{{RMAppAttemptContainerFinishedEvent}} and 
{{LeafQueue#removeApplicationAttempt}} will be called again

To fix this, I posted a patch by adding a check before 
{{LeafQueue#removeApplicationAttempt}} to make sure the app attempt still 
exists at the time deleting it. Another thing is to disable restart app-attempt 
in the UT case to avoid another issue (this UT is supposed to only check 
resources for one app-attempt). Put this two together, this UT should be able 
to run good.

[~snemeth], [~haibochen] Please help to review.

Thanks


was (Author: cheersyang):
Took a bit of time investigating this issue as it happens to us too. I can 
stably reproduce this issue by adding a sleep right after following line:
{code:java}
// Kill the application
cs.handle(new AppAttemptRemovedSchedulerEvent(am1.getApplicationAttemptId(),
        RMAppAttemptState.KILLED, false));
// Sleep a few seconds to wait all events are
// handled before verifying the metrics
Thread.sleep(3000);{code}
Like [~haibochen] mentioned, this issue is because 
{{LeafQueue#finishApplicationAttempt}} was called *twice* for *same* 
app-attempt, causing the metrics incorrectly deducted to *-1*
{code:java}
default #user-pending-applications: -1
{code}
This is how it happens
 # UT case killed the app attempt by triggering a 
{{AppAttemptRemovedSchedulerEvent}}, this will cause the leafQueue to 
{{removeApplicationAttempt}} immediately
 # Scheduler then releases all containers, including the AM container
 # When AM container is killed, it will trigger a 
{{RMAppAttemptContainerFinishedEvent}} and 
{{LeafQueue#removeApplicationAttempt}} will be called again

To fix this, I posted a patch by adding a check before 
{{LeafQueue#removeApplicationAttempt}} to make sure the app attempt still 
exists at the time deleting it. Another thing is to disable restart app-attempt 
in the UT case to avoid another issue (this UT is supposed to only check 
resources for one app-attempt). Put this two together, this UT should be able 
to run good.

[~snemeth], [~haibochen] Please help to review.

> TestContainerResizing.testIncreaseContainerUnreservedWhenApplicationCompleted 
> failed
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-7748
>                 URL: https://issues.apache.org/jira/browse/YARN-7748
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.0.0
>            Reporter: Haibo Chen
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-7748.001.patch
>
>
> TestContainerResizing.testIncreaseContainerUnreservedWhenApplicationCompleted
> Failing for the past 1 build (Since Failed#19244 )
> Took 0.4 sec.
> *Error Message*
> expected null, but 
> was:<org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager$User@6193932a>
> *Stacktrace*
> {code}
> java.lang.AssertionError: expected null, but 
> was:<org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager$User@6193932a>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotNull(Assert.java:664)
>       at org.junit.Assert.assertNull(Assert.java:646)
>       at org.junit.Assert.assertNull(Assert.java:656)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing.testIncreaseContainerUnreservedWhenApplicationCompleted(TestContainerResizing.java:826)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to