[jira] [Updated] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-21 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3036:

Attachment: TEZ-3036.002.patch

Attaching patch version with comments and volatile ServiceError.thread.  Thanks 
for the review!  Committing this.

> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3036.001.patch, TEZ-3036.002.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-15 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3036:

Attachment: TEZ-3036.001.patch

Attaching a prototype patch that seems to fix the issue.  This has the 
ServiceThread invoke the state change callback for dependent services when 
starting the service throws.  Still needs a unit test, but I manually tested by 
hardcoding WebUIService to throw an error when it starts.

Initially I thought of a simpler approach where it simply converts any 
exception caught by the ServiceThread into an error and let the uncaught 
exception handler tear everything down.  However this also hangs because the 
DAGAppMaster shutdown hook ends up waiting for the lock being held during 
service startup.


> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3036.001.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)