[
https://issues.apache.org/jira/browse/YARN-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771209#comment-16771209
]
Prabhu Joseph edited comment on YARN-9311 at 2/18/19 4:40 PM:
--------------------------------------------------------------
[~rohithsharma] [~sunilg]
{{TestRMRestart#testRMStateStoreDispatcherDrainedOnRMStop}} runs in an infinite
loop at {{MockRM}} start -> {{MemoryRMStateStore#handleStoreEvent}} -
while(wait). This acquires lock on {{AbstractService}} and blocks all other
test cases when starting / stopping {{MockRM}}, causing time out. Have fixed
this by handling RM start in a separate thread. After fixing this, {{MockRM}}
stop at end hangs waiting to lock {{AbstractService}} - fixed this by calling
explicitly {{RMStateStore#close}} so that while loop exits and releases the
lock.
Infinite Loop:
{code}
protected void handleStoreEvent(RMStateStoreEvent event) {
if (!(event instanceof RMStateStoreAMRMTokenEvent)
&& !(event instanceof RMStateStoreRMDTEvent)
&& !(event instanceof RMStateStoreRMDTMasterKeyEvent)) {
while (wait);
}
"Time-limited test" #392 daemon prio=5 os_prio=31 tid=0x00007f8d2f041800
nid=0x11c13 runnable [0x0000700001887000]
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart$6.handleStoreEvent(TestRMRestart.java:1650)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeProxyCACert(RMStateStore.java:1363)
at
org.apache.hadoop.yarn.server.resourcemanager.security.ProxyCAManager.serviceStart(ProxyCAManager.java:63)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
- locked <0x000000079b8cb4b8> (a java.lang.Object)
at
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:914)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
- locked <0x000000079b7022c8> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1280)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1321)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1317)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1317)
- locked <0x000000079b305528> (a
org.apache.hadoop.yarn.server.resourcemanager.MockRM)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1368)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
- locked <0x000000079b305680> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMStateStoreDispatcherDrainedOnRMStop(TestRMRestart.java:1660)
{code}
The test results shows no timeout and {{TestRMRestart}} runs fine. The failures
are not related.
{code}
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Tests run: 68, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.208
s - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on
project hadoop-yarn-server-resourcemanager: There are test failures.
{code}
The previous runs always times out and {{TestRMRestart}} never completes -
https://builds.apache.org/job/PreCommit-YARN-Build/23422/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
{code}
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.468 s
- in org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on
project hadoop-yarn-server-resourcemanager: There was a timeout or other error
in the fork -> [Help 1]
{code}
Can you review this patch when you get time.
was (Author: prabhu joseph):
[~rohithsharma] [~sunilg]
{{TestRMRestart#testRMStateStoreDispatcherDrainedOnRMStop}} runs in an infinite
loop at {{MockRM}} start -> {{MemoryRMStateStore#handleStoreEvent }} -
while(wait). This acquires lock on {{AbstractService}} and blocks all other
test cases when starting / stopping {{MockRM}}, causing time out. Have fixed
this by handling RM start in a separate thread. After fixing this, {{MockRM}}
stop at end hangs waiting to lock {{AbstractService}} - fixed this by calling
explicitly {{RMStateStore#close}} so that while loop exits and releases the
lock.
Infinite Loop:
{code}
protected void handleStoreEvent(RMStateStoreEvent event) {
if (!(event instanceof RMStateStoreAMRMTokenEvent)
&& !(event instanceof RMStateStoreRMDTEvent)
&& !(event instanceof RMStateStoreRMDTMasterKeyEvent)) {
while (wait);
}
"Time-limited test" #392 daemon prio=5 os_prio=31 tid=0x00007f8d2f041800
nid=0x11c13 runnable [0x0000700001887000]
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart$6.handleStoreEvent(TestRMRestart.java:1650)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeProxyCACert(RMStateStore.java:1363)
at
org.apache.hadoop.yarn.server.resourcemanager.security.ProxyCAManager.serviceStart(ProxyCAManager.java:63)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
- locked <0x000000079b8cb4b8> (a java.lang.Object)
at
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:914)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
- locked <0x000000079b7022c8> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1280)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1321)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1317)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1317)
- locked <0x000000079b305528> (a
org.apache.hadoop.yarn.server.resourcemanager.MockRM)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1368)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
- locked <0x000000079b305680> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMStateStoreDispatcherDrainedOnRMStop(TestRMRestart.java:1660)
{code}
The test results shows no timeout and {{TestRMRestart}} runs fine. The failures
are not related.
{code}
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Tests run: 68, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.208
s - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on
project hadoop-yarn-server-resourcemanager: There are test failures.
{code}
The previous run - always times out and {{TestRMRestart}} never completes -
https://builds.apache.org/job/PreCommit-YARN-Build/23422/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
{code}
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.468 s
- in org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on
project hadoop-yarn-server-resourcemanager: There was a timeout or other error
in the fork -> [Help 1]
{code}
Can you review this patch when you get time.
> TestRMRestart hangs due to a deadlock
> -------------------------------------
>
> Key: YARN-9311
> URL: https://issues.apache.org/jira/browse/YARN-9311
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Prabhu Joseph
> Assignee: Prabhu Joseph
> Priority: Major
> Attachments: YARN-9311-001.patch, jstackdata, jstackdata1
>
>
> TestRMRestart deadlocks between
> testRMStateStoreDispatcherDrainedOnRMStop#handleStoreEvent and tearDown.
> Attached jstack captured.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]