[ 
https://issues.apache.org/jira/browse/YARN-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771209#comment-16771209
 ] 

Prabhu Joseph edited comment on YARN-9311 at 2/18/19 4:40 PM:
--------------------------------------------------------------

[~rohithsharma] [~sunilg] 
{{TestRMRestart#testRMStateStoreDispatcherDrainedOnRMStop}} runs in an infinite 
loop at {{MockRM}} start -> {{MemoryRMStateStore#handleStoreEvent}} - 
while(wait). This acquires lock on {{AbstractService}} and blocks all other 
test cases when starting / stopping {{MockRM}}, causing time out. Have fixed 
this by handling RM start in a separate thread. After fixing this, {{MockRM}} 
stop at end hangs waiting to lock {{AbstractService}} - fixed this by calling 
explicitly {{RMStateStore#close}} so that while loop exits and releases the 
lock.


Infinite Loop:
{code}
  protected void handleStoreEvent(RMStateStoreEvent event) {
        if (!(event instanceof RMStateStoreAMRMTokenEvent)
            && !(event instanceof RMStateStoreRMDTEvent)
            && !(event instanceof RMStateStoreRMDTMasterKeyEvent)) {
          while (wait);
        }


"Time-limited test" #392 daemon prio=5 os_prio=31 tid=0x00007f8d2f041800 
nid=0x11c13 runnable [0x0000700001887000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart$6.handleStoreEvent(TestRMRestart.java:1650)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeProxyCACert(RMStateStore.java:1363)
        at 
org.apache.hadoop.yarn.server.resourcemanager.security.ProxyCAManager.serviceStart(ProxyCAManager.java:63)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
        - locked <0x000000079b8cb4b8> (a java.lang.Object)
        at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:914)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
        - locked <0x000000079b7022c8> (a java.lang.Object)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1280)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1321)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1317)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1317)
        - locked <0x000000079b305528> (a 
org.apache.hadoop.yarn.server.resourcemanager.MockRM)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1368)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
        - locked <0x000000079b305680> (a java.lang.Object)
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMStateStoreDispatcherDrainedOnRMStop(TestRMRestart.java:1660)
{code}

The test results shows no timeout and {{TestRMRestart}} runs fine. The failures 
are not related.

{code}
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Tests run: 68, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.208 
s - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on 
project hadoop-yarn-server-resourcemanager: There are test failures.
{code}

The previous runs always times out and {{TestRMRestart}} never completes - 
https://builds.apache.org/job/PreCommit-YARN-Build/23422/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt

{code}

[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.468 s 
- in org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on 
project hadoop-yarn-server-resourcemanager: There was a timeout or other error 
in the fork -> [Help 1]
{code}

Can you review this patch when you get time.








was (Author: prabhu joseph):
[~rohithsharma] [~sunilg] 
{{TestRMRestart#testRMStateStoreDispatcherDrainedOnRMStop}} runs in an infinite 
loop at {{MockRM}} start -> {{MemoryRMStateStore#handleStoreEvent }} - 
while(wait). This acquires lock on {{AbstractService}} and blocks all other 
test cases when starting / stopping {{MockRM}}, causing time out. Have fixed 
this by handling RM start in a separate thread. After fixing this, {{MockRM}} 
stop at end hangs waiting to lock {{AbstractService}} - fixed this by calling 
explicitly {{RMStateStore#close}} so that while loop exits and releases the 
lock.


Infinite Loop:
{code}
  protected void handleStoreEvent(RMStateStoreEvent event) {
        if (!(event instanceof RMStateStoreAMRMTokenEvent)
            && !(event instanceof RMStateStoreRMDTEvent)
            && !(event instanceof RMStateStoreRMDTMasterKeyEvent)) {
          while (wait);
        }


"Time-limited test" #392 daemon prio=5 os_prio=31 tid=0x00007f8d2f041800 
nid=0x11c13 runnable [0x0000700001887000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart$6.handleStoreEvent(TestRMRestart.java:1650)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeProxyCACert(RMStateStore.java:1363)
        at 
org.apache.hadoop.yarn.server.resourcemanager.security.ProxyCAManager.serviceStart(ProxyCAManager.java:63)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
        - locked <0x000000079b8cb4b8> (a java.lang.Object)
        at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:914)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
        - locked <0x000000079b7022c8> (a java.lang.Object)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1280)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1321)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1317)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1317)
        - locked <0x000000079b305528> (a 
org.apache.hadoop.yarn.server.resourcemanager.MockRM)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1368)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
        - locked <0x000000079b305680> (a java.lang.Object)
        at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMStateStoreDispatcherDrainedOnRMStop(TestRMRestart.java:1660)
{code}

The test results shows no timeout and {{TestRMRestart}} runs fine. The failures 
are not related.

{code}
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Tests run: 68, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.208 
s - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on 
project hadoop-yarn-server-resourcemanager: There are test failures.
{code}

The previous run - always times out and {{TestRMRestart}} never completes - 
https://builds.apache.org/job/PreCommit-YARN-Build/23422/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt

{code}

[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.468 s 
- in org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on 
project hadoop-yarn-server-resourcemanager: There was a timeout or other error 
in the fork -> [Help 1]
{code}

Can you review this patch when you get time.







> TestRMRestart hangs due to a deadlock
> -------------------------------------
>
>                 Key: YARN-9311
>                 URL: https://issues.apache.org/jira/browse/YARN-9311
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>            Priority: Major
>         Attachments: YARN-9311-001.patch, jstackdata, jstackdata1
>
>
> TestRMRestart deadlocks between 
> testRMStateStoreDispatcherDrainedOnRMStop#handleStoreEvent and tearDown. 
> Attached jstack captured.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to