[jira] [Updated] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated TEZ-3036: Attachment: TEZ-3036.002.patch Attaching patch version with comments and volatile ServiceError.thread. Thanks for the review! Committing this. > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3036.001.patch, TEZ-3036.002.patch > > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3036) Tez AM can hang on startup with no indication of error
[ https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated TEZ-3036: Attachment: TEZ-3036.001.patch Attaching a prototype patch that seems to fix the issue. This has the ServiceThread invoke the state change callback for dependent services when starting the service throws. Still needs a unit test, but I manually tested by hardcoding WebUIService to throw an error when it starts. Initially I thought of a simpler approach where it simply converts any exception caught by the ServiceThread into an error and let the uncaught exception handler tear everything down. However this also hangs because the DAGAppMaster shutdown hook ends up waiting for the lock being held during service startup. > Tez AM can hang on startup with no indication of error > -- > > Key: TEZ-3036 > URL: https://issues.apache.org/jira/browse/TEZ-3036 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Priority: Critical > Attachments: TEZ-3036.001.patch > > > I've seen a couple of instances where the Tez AM fails to complete the > startup sequence. It never gets around to registering with the > ResourceManager, so the RM eventually times out the attempt and starts > another. The subsequent attempts do the same. There are no indications in > the logs that anything is wrong, rather it just seems to get stuck during > startup then a bit over 10 minutes later is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)