[jira] [Commented] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines

2013-11-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830328#comment-13830328
 ] 

Karthik Kambatla commented on YARN-1430:


bq. But as of today, we are treating them inconsistently. An invalid event to 
the scheduler crashes the RM but an invalid event in RMNode isn't. We need to 
be consistent.
I think it is reasonable to be inconsistent here. The rationale being we should 
crash the RM only if there is absolutely no go: only some 
InvalidStateTransitions (e.g. in scheduler) affect everything on the cluster, 
others are specific to a node or an app. For localized damage, crashing the RM 
seems too aggressive. I agree we should bubble up these to the UI.

 InvalidStateTransition exceptions are ignored in state machines
 ---

 Key: YARN-1430
 URL: https://issues.apache.org/jira/browse/YARN-1430
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 We have all state machines ignoring InvalidStateTransitions. These exceptions 
 will get logged but will not crash the RM / NM. We definitely should crash it 
 as they move the system into some invalid / unacceptable state.
 * Places where we hide this exception :-
 ** JobImpl
 ** TaskAttemptImpl
 ** TaskImpl
 ** NMClientAsyncImpl
 ** ApplicationImpl
 ** ContainerImpl
 ** LocalizedResource
 ** RMAppAttemptImpl
 ** RMAppImpl
 ** RMContainerImpl
 ** RMNodeImpl
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines

2013-11-21 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829400#comment-13829400
 ] 

Omkar Vinit Joshi commented on YARN-1430:
-

I think for now we should add assert statements so that in test environment it 
will always fail making sure we are not missing some invalid transitions? 
YARN-1416 is one of those examples.

I agree with [~vinodkv] and [~jlowe]. Probably we should be consistent 
everywhere and should show somewhere these system critical errors without 
actually crashing daemons.

 InvalidStateTransition exceptions are ignored in state machines
 ---

 Key: YARN-1430
 URL: https://issues.apache.org/jira/browse/YARN-1430
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 We have all state machines ignoring InvalidStateTransitions. These exceptions 
 will get logged but will not crash the RM / NM. We definitely should crash it 
 as they move the system into some invalid / unacceptable state.
 * Places where we hide this exception :-
 ** JobImpl
 ** TaskAttemptImpl
 ** TaskImpl
 ** NMClientAsyncImpl
 ** ApplicationImpl
 ** ContainerImpl
 ** LocalizedResource
 ** RMAppAttemptImpl
 ** RMAppImpl
 ** RMContainerImpl
 ** RMNodeImpl
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines

2013-11-21 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829018#comment-13829018
 ] 

Jason Lowe commented on YARN-1430:
--

Before flipping the switch to change this, we need to carefully consider the 
consequences.  I'm all for making this a fatal error for unit tests, but I'm 
not convinced this is a good thing for production environments.

We have been running in production for quite some time now (0.23 instead of 
2.x, but the code is very similar in many of these areas).  We've seen invalid 
state transitions logged on our production machines and have filed quite a few 
JIRAs related to those.  However I was often thankful the invalid state 
transition did not crash, because in the vast majority of these cases the 
system can continue to function in an acceptable manner.  Sure, we might leak 
some resources related to an application, fail to aggregate some log or 
something similar, but I'd rather take that pain with a potential workaround 
than the alternative of bringing down the entire cluster each and every time it 
occurs.

What I'm worried about here is a case where we don't see the error during 
testing but when we deploy to production some critical, frequent job 
consistently triggers an unhandled transition.  If that's always fatal, now 
we're stuck in a state where the cluster cannot stay up very long until we 
scramble to develop and deploy a fix or have to rollback, and we have 
guaranteed downtime when it occurs.  In almost all of these cases the invalid 
transition is going to be localized to just one app, one container, or one 
node.  I'm not sure that kind of error is worth taking down an entire cluster 
outside of a testing setup.  I feel this is similar to how most software 
products handle asserts -- they are fatal during development but not during 
production.

 InvalidStateTransition exceptions are ignored in state machines
 ---

 Key: YARN-1430
 URL: https://issues.apache.org/jira/browse/YARN-1430
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 We have all state machines ignoring InvalidStateTransitions. These exceptions 
 will get logged but will not crash the RM / NM. We definitely should crash it 
 as they move the system into some invalid / unacceptable state.
 * Places where we hide this exception :-
 ** JobImpl
 ** TaskAttemptImpl
 ** TaskImpl
 ** NMClientAsyncImpl
 ** ApplicationImpl
 ** ContainerImpl
 ** LocalizedResource
 ** RMAppAttemptImpl
 ** RMAppImpl
 ** RMContainerImpl
 ** RMNodeImpl
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines

2013-11-21 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829229#comment-13829229
 ] 

Vinod Kumar Vavilapalli commented on YARN-1430:
---

There are pros and cons to both approaches.

If we completely ignore the errors, nobody knows about the problem. One 
solution to this is have these invalid transitions bubble up to the UI, say on 
RM UI, AM UI etc in wild, bold and red colors.

On the other side, I agree that crashing RM all the time is going to be more 
and more painful in production environments.

As for tests, I think we SHOULD clearly crash the tests, so that we can catch 
as many of these errors as quickly as possible.

But as of today, we are treating them inconsistently. An invalid event to the 
scheduler crashes the RM but an invalid event in RMNode isn't. We need to be 
consistent.

 InvalidStateTransition exceptions are ignored in state machines
 ---

 Key: YARN-1430
 URL: https://issues.apache.org/jira/browse/YARN-1430
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 We have all state machines ignoring InvalidStateTransitions. These exceptions 
 will get logged but will not crash the RM / NM. We definitely should crash it 
 as they move the system into some invalid / unacceptable state.
 * Places where we hide this exception :-
 ** JobImpl
 ** TaskAttemptImpl
 ** TaskImpl
 ** NMClientAsyncImpl
 ** ApplicationImpl
 ** ContainerImpl
 ** LocalizedResource
 ** RMAppAttemptImpl
 ** RMAppImpl
 ** RMContainerImpl
 ** RMNodeImpl
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)