[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-20 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109283#comment-15109283
 ] 

Siddharth Seth commented on TEZ-3036:
-

Is this mainly caused by the way AbstractService etc handles Exceptions ? It 
just throws them back to the invoker without any kind of state notifications. 
The no notifications is consistent with the method name since the state didn't 
actually change. Maybe YARN AbstractServices need an ERROR state to handle this 
better ?

The change invokes stateChanged in case of an error. One potential issue here 
is that AbstractService catches only Exceptions - so 
"dependency.getFailureCause" would end up being null in case of an Error. 
However the ServiceThread would have caught the Throwable and invoked 
StateChanged. Is the intent to depend on the UncaughExceptionHandler for this ?

Not related to this patch: ServiceThread.error needs to be volatile.

> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3036.001.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-20 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109712#comment-15109712
 ] 

Siddharth Seth commented on TEZ-3036:
-

bq. Eventually the starts complete, and when the main thread joins with the 
service thread that caused the error we'll log it and exit. Not the most ideal 
solution, but it works in practice. I'm definitely open to other ideas on how 
to approach this.
Thanks for the explanation. I'm fine with the patch going in as is. Could you 
please add some comments around this before committing the patch though. Also 
ServiceThread.error to volatile.

I was thinking along the lines of skipping the stateChange from AbstractService 
altogether since it appears to be broken - and implementing a custom interface 
for that, which is invoked. That isn't a simple change though. Fixing this in 
YARN itself would be ideal - with a change in stateChange behaviour / addition 
of an error notification. However, that won't be useable for a while.


> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3036.001.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-15 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102478#comment-15102478
 ] 

Jason Lowe commented on TEZ-3036:
-

My apologies, I misread the heap dump info.  NoSuchMethodError was being 
propagated up rather than NoSuchMethodException (which is the cause of the 
error).  The issue occurs with no notification when an error is thrown rather 
than an exception.  If an exception is thrown then it will log it but I think 
it will still hang.

Here's the relevant portions of the stacktrace when this occurs:
{noformat}
"ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler" #34 prio=5 
os_prio=0 tid=0x7f2be161c000 nid=0x65a4 in Object.wait() 
[0x7f2bb87b5000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xf55ddd58> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency)
at 
org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1655)
- locked <0xf55ddd58> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency)
at 
org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1693)

 
"main" #1 prio=5 os_prio=0 tid=0x7f2be0019800 nid=0x653b in Object.wait() 
[0x7f2be57e1000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xf6857998> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceThread)
at java.lang.Thread.join(Thread.java:1245)
- locked <0xf6857998> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceThread)
at java.lang.Thread.join(Thread.java:1319)
at 
org.apache.tez.dag.app.DAGAppMaster.startServices(DAGAppMaster.java:1730)
at 
org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1799)
- locked <0xa0326928> (a org.apache.tez.dag.app.DAGAppMaster)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
- locked <0xa0326ae8> (a java.lang.Object)
at org.apache.tez.dag.app.DAGAppMaster$6.run(DAGAppMaster.java:2369)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
at 
org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:2365)
at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:2173)
{noformat}


> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Priority: Critical
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-15 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102770#comment-15102770
 ] 

TezQA commented on TEZ-3036:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12782610/TEZ-3036.001.patch
  against master revision b0ba133.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1426//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1426//console

This message is automatically generated.

> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3036.001.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-13 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097090#comment-15097090
 ] 

Hitesh Shah commented on TEZ-3036:
--

Any chance of getting thread dumps ( as well as the logs ) to shed more light 
on this? 

> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Priority: Critical
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097107#comment-15097107
 ] 

Jason Lowe commented on TEZ-3036:
-

In this particular instance the hang occurred because TaskSchedulerEventHandler 
was still wating on its dependencies to finish before it could start, but one 
of the dependencies failed without any log message, backtrace, or any 
indication there was an issue.  There was a ClassNotFoundException thrown 
during the WebUIService startup due to a conflict with one of the jars provided 
by the user.  When the error bubbled up through AbstractService it looks like 
the state change listener callbacks were skipped.  The exception was caught by 
ServiceThread but then silently stored away into a local, suppressing the error.

I think ServiceWithDependency relies on the fact that stateChanged will be 
called when one of the dependencies failed.  Since that wasn't called, 
TaskSchedulerEventHandler hung around waiting for the dependencies to finish 
without realizing they already were finished.  The error stored in 
ServiceThread was never shown because it isn't logged until all ServiceThreads 
have been joined.  Since one of the threads was mistakenly waiting on an 
already failed service, the join never completes and the error will never be 
logged.  End result is a hang with no indication of any error.

> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Priority: Critical
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3036) Tez AM can hang on startup with no indication of error

2016-01-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097108#comment-15097108
 ] 

Jason Lowe commented on TEZ-3036:
-

Haven't verified this, but I suspect this can be replicated by simply 
hardcoding the WebUIService to throw an exception like ClassNotFoundException.

> Tez AM can hang on startup with no indication of error
> --
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Priority: Critical
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)