[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109283#comment-15109283 ] Siddharth Seth commented on TEZ-3036: - Is this mainly caused by the way AbstractService etc handles Exceptions ? It just throws them back to the invoker without any kind of state notifications. The no notifications is consistent with the method name since the state didn't actually change. Maybe YARN AbstractServices need an ERROR state to handle this better ? The change invokes stateChanged in case of an error. One potential issue here is that AbstractService catches only Exceptions - so "dependency.getFailureCause" would end up being null in case of an Error. However the ServiceThread would have caught the Throwable and invoked StateChanged. Is the intent to depend on the UncaughExceptionHandler for this ? Not related to this patch: ServiceThread.error needs to be volatile. > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3036.001.patch > > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109712#comment-15109712 ] Siddharth Seth commented on TEZ-3036: - bq. Eventually the starts complete, and when the main thread joins with the service thread that caused the error we'll log it and exit. Not the most ideal solution, but it works in practice. I'm definitely open to other ideas on how to approach this. Thanks for the explanation. I'm fine with the patch going in as is. Could you please add some comments around this before committing the patch though. Also ServiceThread.error to volatile. I was thinking along the lines of skipping the stateChange from AbstractService altogether since it appears to be broken - and implementing a custom interface for that, which is invoked. That isn't a simple change though. Fixing this in YARN itself would be ideal - with a change in stateChange behaviour / addition of an error notification. However, that won't be useable for a while. > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3036.001.patch > > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102478#comment-15102478 ] Jason Lowe commented on TEZ-3036: - My apologies, I misread the heap dump info. NoSuchMethodError was being propagated up rather than NoSuchMethodException (which is the cause of the error). The issue occurs with no notification when an error is thrown rather than an exception. If an exception is thrown then it will log it but I think it will still hang. Here's the relevant portions of the stacktrace when this occurs: {noformat} "ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler" #34 prio=5 os_prio=0 tid=0x7f2be161c000 nid=0x65a4 in Object.wait() [0x7f2bb87b5000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xf55ddd58> (a org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency) at org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1655) - locked <0xf55ddd58> (a org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency) at org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1693) "main" #1 prio=5 os_prio=0 tid=0x7f2be0019800 nid=0x653b in Object.wait() [0x7f2be57e1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xf6857998> (a org.apache.tez.dag.app.DAGAppMaster$ServiceThread) at java.lang.Thread.join(Thread.java:1245) - locked <0xf6857998> (a org.apache.tez.dag.app.DAGAppMaster$ServiceThread) at java.lang.Thread.join(Thread.java:1319) at org.apache.tez.dag.app.DAGAppMaster.startServices(DAGAppMaster.java:1730) at org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1799) - locked <0xa0326928> (a org.apache.tez.dag.app.DAGAppMaster) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) - locked <0xa0326ae8> (a java.lang.Object) at org.apache.tez.dag.app.DAGAppMaster$6.run(DAGAppMaster.java:2369) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:2365) at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:2173) {noformat} > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Priority: Critical > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102770#comment-15102770 ] TezQA commented on TEZ-3036: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12782610/TEZ-3036.001.patch against master revision b0ba133. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1426//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1426//console This message is automatically generated. > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3036.001.patch > > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097090#comment-15097090 ] Hitesh Shah commented on TEZ-3036: -- Any chance of getting thread dumps ( as well as the logs ) to shed more light on this? > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Priority: Critical > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097107#comment-15097107 ] Jason Lowe commented on TEZ-3036: - In this particular instance the hang occurred because TaskSchedulerEventHandler was still wating on its dependencies to finish before it could start, but one of the dependencies failed without any log message, backtrace, or any indication there was an issue. There was a ClassNotFoundException thrown during the WebUIService startup due to a conflict with one of the jars provided by the user. When the error bubbled up through AbstractService it looks like the state change listener callbacks were skipped. The exception was caught by ServiceThread but then silently stored away into a local, suppressing the error. I think ServiceWithDependency relies on the fact that stateChanged will be called when one of the dependencies failed. Since that wasn't called, TaskSchedulerEventHandler hung around waiting for the dependencies to finish without realizing they already were finished. The error stored in ServiceThread was never shown because it isn't logged until all ServiceThreads have been joined. Since one of the threads was mistakenly waiting on an already failed service, the join never completes and the error will never be logged. End result is a hang with no indication of any error. > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Priority: Critical > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097108#comment-15097108 ] Jason Lowe commented on TEZ-3036: - Haven't verified this, but I suspect this can be replicated by simply hardcoding the WebUIService to throw an exception like ClassNotFoundException. > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Priority: Critical > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)