[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-12-06 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13511501#comment-13511501
 ] 

Bikas Saha commented on YARN-128:
-

Yes we need to. This is because many things like failure tracking of AM 
attempts, job history, log and debug information are tied to attempts and so we 
cannot forget them.
Also, restarting everything is just the first step. We want to move towards a 
work-preserving restart (see doc on jira) and the current approach builds the 
ground work for it.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-12-04 Thread Strahinja Lazetic (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509663#comment-13509663
 ] 

Strahinja Lazetic commented on YARN-128:


Bikas, I have one question; Since we reboot NMs and terminate all the running 
containers and AMs upon the RM restart, why do we need to keep track of the 
previous Applications' attempts? Couldn't we just start from scratch instead 
of generating the next attempt id based on the last running one?

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-28 Thread Arinto Murdopo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505615#comment-13505615
 ] 

Arinto Murdopo commented on YARN-128:
-

Tested the YARN-128.full-code.5.patch, using ZooKeeper store and the result is 
positive. ResourceManager resurrected properly after we killed it.
Experiment overview:
- ZK settings: 1 ZK-Server consisted of 3 different nodes
- HDFS was in single-node setting. YARN and HDFS was executed in the same node.
- Executed bbp and pi examples from the generated hadoop distribution (we built 
and packaged the trunk and patch code)
- Killed ResourceManager process when bbp or pi was executing(using Linux kill 
command) and started new RM 3 seconds after we killed it. 

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505656#comment-13505656
 ] 

Hadoop QA commented on YARN-128:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12554338/YARN-128.full-code.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 34 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/183//console

This message is automatically generated.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-28 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505857#comment-13505857
 ] 

Bikas Saha commented on YARN-128:
-

Thanks for using it Arinto and posting results!

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-20 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501172#comment-13501172
 ] 

Bikas Saha commented on YARN-128:
-

Attaching final patch with full changes for a test run. Can someone with access 
please trigger a test run on JIRA?
Changes
1) Completed handling on unmanaged AM's
2) Refactored ZK and FileSystem store classes to move common logic into the 
base class and also integrate with the RM
3) Test improvements
I have tested manually on a single node with both ZK and FileSystem store 
(using HDFS) and run wordcount job across a restart.

I will create sub-tasks of this jira to break the changes into logical pieces.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-20 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501221#comment-13501221
 ] 

Bikas Saha commented on YARN-128:
-

Done creating sub-tasks and attaching final patches for review and commit.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-19 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500434#comment-13500434
 ] 

Tom White commented on YARN-128:


I had a quick look at the new patches and FileSystemRMStateStore and 
ZKRMStateStore seem to be missing default constructors, which StoreFactory 
needs. You might change the tests to use StoreFactory to construct the store 
instances to test this code path.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-16 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498726#comment-13498726
 ] 

Bikas Saha commented on YARN-128:
-

1) Unless I am mistaken, the test condition is correct. app1 is the app 
actually submitted while appState is the state retrieved from the store. By 
checking that both are the same, we are checking that the data that was 
supposed to be passed has actually been passed to the store and there is no bug 
in the transfer of that data. The assert will be false if the transfer does not 
happen or some other value gets passed by mistake. Does that help clarify?

3) Which resource value is this? The one that is store in 
ApplicationSubmissionContext-ContainerLaunchContext? In the patch, the 
ApplicationSubmissionContext is being store at the very beginning to ensure 
that the client does not have to submit the job again. Hence, the Resource set 
by the client is saved. I am not sure what your project is saving after the 
scheduling is done. 
You are right. We dont want to store the updated value since this updated value 
is a side-effect of the policy of the scheduler.

I am not sure if this applies to your project. I will be shortly posting an 
Zookeeper and HDFS state store that you could use unless you are using your own 
storage mechanism.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
 RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-16 Thread Arinto Murdopo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498800#comment-13498800
 ] 

Arinto Murdopo commented on YARN-128:
-

1) Yes, I agree with your clarification. It works as what you state when we are 
using persistent storage (not MemStore, but ZK, MySQL, file or other persistent 
storage)
However, when we are using MemStore, the stored object (appState) and app1 are 
referring to the same instance since our store is memory. To test my 
argument, we can put breakpoint in the assert statement that compares the 
ApplicationSubmissionContext, then use IDE feature to change any value of 
appState's properties  i.e resource in ApplicationSubmissionContext. The 
corresponding app1 value (in this case is the resource in app1's 
ApplicationSubmissionContext) will also be updated to the same value.

3). Yes, it's in Resource in 
ApplicationSubmissionContext-ContainerLaunchContext. e 
If we are saving the original resource value requested by client, then the 
assert statement that compare ApplicationSubmissionContext will not pass. 
Let's say Client request resource of memory with value of 200. We store this in 
our persistent storage. After we store, scheduler updates the resource with 
value of 1024. In this case, the resource in app1 instance will be 1024, but 
the resource that stored in our storage is 200. Hence, it will not pass when we 
compare them using current assert statement. Maybe we need to keep storing our 
original resource request in ApplicationSubmissionContext.

Looking forward to your ZK and HDFS state store. The state store in our project 
is using MySQL cluster.  

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
 RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-16 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498825#comment-13498825
 ] 

Tom White commented on YARN-128:


Bikas, this looks good so far. Thanks for working on it. A few comments:

* Is there a race condition in ResourceManager#recover where RMAppImpl#recover 
is called after the StartAppAttemptTransition from resubmitting the app? The 
problem would be that the earlier app attempts (from before the resart) would 
not be the first ones since the new attempt would get in first.
* I think we need the concept of a 'killed' app attempt (when the system is at 
fault, not the app) as well as a 'failed' attempt, like we have in MR task 
attempts. Without the distinction a restart will count against the user's app 
attempts (default 1 retry) which is undesirable.
* Rather than change the ResourceManager constructor, you could read the 
recoveryEnabled flag from the configuration.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
 RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-16 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498876#comment-13498876
 ] 

Bikas Saha commented on YARN-128:
-

@Arinto
Thanks for using the code!
1) Yes. Both are the same object. But that is what the test is testing. That 
the context that got saved in the store is the same as the one the app was 
submitted with. We are doing this with an in memory store that lets us examine 
the stored data and compare it with the real data. A real store would save this 
the data. So comparison is not possible.
3) Yes. It seems incorrect to store scheduler side-effects. e.g. upon restart 
if the scheduler config make minimum container size = 512 then again it will 
not match.
I am attaching a patch for a ZK store that you can try. It applies on top of 
the current full patch.

@Tom
Thanks for reviewing!
1) There is no race condition because the Dispatcher has not been started yet 
and hence the attempt start event has not been processed. There is a comment to 
that effect in the code.
2) I agree. I had thought about it too. But it looks like the current behavior 
(before this patch) does this because it does not differentiate killed/failed 
attempts when deciding that the attempt retry limit has been reached. So I 
thought about leaving it for a separate jira which would be unrelated to this. 
Once that is done this code could use it and not count the restarted attempt. 
This patch is already huge. Does that sound good?
3) Yes. That could be done. The constructor makes it easier to write tests 
without mangling configs.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-16 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498883#comment-13498883
 ] 

Tom White commented on YARN-128:


You are right about there being no race - I missed the comment! I opened 
YARN-218 for the killed/failed distinction as I agree it can be tackled 
separately.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-13 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496129#comment-13496129
 ] 

Bikas Saha commented on YARN-128:
-

Attaching rebased patches + change RMStateStore to throw exception to notify 
about store errors.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
 RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-11 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494892#comment-13494892
 ] 

Bikas Saha commented on YARN-128:
-

Updating patches for new code and combined patch.
Changes
1) Code added to remove application data upon completion
2) All TODO's examined and removed/fixed.
3) Improved TestRMRestart and its readability
4) Added more tests for RMAppAttemptTransitions
5) Refactored RMStateStore into an abstract class so that it can implement 
common functionality to notify app attempt about async store operation 
completion

Fix for capacity scheduler bug is still in the patch because it blocks test 
completion. The issue is also tracked in YARN-209


 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
 RMRestartPhase1.pdf, YARN-128-combined.patch, YARN-128.new-code.1.patch, 
 YARN-128.new-code.2.patch, YARN-128.patch, YARN-128.remove-old-code.1.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-11 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495097#comment-13495097
 ] 

Bikas Saha commented on YARN-128:
-

Attaching rebased patches

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
 RMRestartPhase1.pdf, YARN-128.combined.2.patch, YARN-128-combined.patch, 
 YARN-128.full-code.3.patch, YARN-128.new-code.1.patch, 
 YARN-128.new-code.2.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.old-code-removed.3.patch, YARN-128.patch, 
 YARN-128.remove-old-code.1.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-11-06 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491635#comment-13491635
 ] 

Bikas Saha commented on YARN-128:
-

Devaraj, I think the current approach+code based on zkstore (that 
YARN-128.patch builds on top of) has some significant issues wrt 
perf/scalability of ZK/future HA. The design outline attached to this jira 
calls out some of the issues. The next proposal document will help clarify a 
bit more I hope.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
 YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-09-25 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13462832#comment-13462832
 ] 

Robert Joseph Evans commented on YARN-128:
--

The problem is that we cannot be truly backwards compatible when adding in this 
feature.  We have to better define the lifecycle of an AM for it to be well 
behaved and properly handle RM recovery.  I would say that if the client asks 
the AM to stop it should still pause on unregister until it can successfully 
unregister, or until it can mark itself as killed in a persistent way like 
with the job history log, so that when that AM is relaunched all it has to do 
is to check a file on HDFS and then unregister.  Perhaps the only way to be 
totally backwards compatible is for the AM to indicate when it registers if it 
supports RM recovery or not.  Or to avoid any race conditions when the client 
launches the AM it would indicate this.  If it does not (legacy AMs), then the 
RM will not try to relaunch it if the AM goes down while the RM is recovering.  
If it does, then the AM will always be relaunched when the RM goes down. 

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-09-24 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13462034#comment-13462034
 ] 

Vinod Kumar Vavilapalli commented on YARN-128:
--

Pasting notes from Bikas inline for easier discussion.

h4.Basic Idea:
Key idea is that the state of the cluster is its current state. So don't save 
all container info.
RM on startup sets a recovery flag on. Informs scheduler via API.
Re-create running AM info from persisted state. Running AM's will heartbeat to 
the RM and be asked to re-sync.
Re-start AM's that have been lost. What about AM's that completed during 
restart. Re-running them should be a no-op.
Ask running and re-started AM's to re-send all pending container requests to 
re-create pending request state.
RM accepts new AM registrations and their requests.
Scheduling pass is not performed when recovery flag is on.
RM waits for nodes to heartbeat and give it container info.
RM passes container info to scheduler so that the scheduler can re-create 
current allocation state.
After recovery time threshold, reset recovery flag and start the scheduling 
pass. Normal from thereon.
Schedulers could save their state and recover previous allocation information 
from that saved state.

h4.What info comes in node heartbeats:
Handle sequence number mismatch during recovery. On heartbeat from node send 
ReRegister command instead of Reboot. NodeManager should continue running 
containers during this time.
RM sends commands back to clean up containers/applications. Can orphans be left 
behind on nodes after RM restart? Will NM be able to auto-clean containers?
ApplicationAttemptId can be gotten from Container objects to map resources back 
to SchedulingApp.

h4.How to pause scheduling pass:
Scheduling pass is triggered on NODE_UPDATE events that happen on node 
heartbeat. Easy to pause under recovery flag.
YarnScheduler.allocate() is the API that needs to be changed.
How to handle container releases messages that were lost when RM was down? Will 
AM's get delivery failure and continue to resend indefinitely?

h4.How to re-create scheduler allocation state:
On node re-register, RM passes container info to scheduler so that the 
scheduler can re-create current allocation state.
Use CsQueue.recoverContainer() to recover previous allocations from currently 
running containers.

h4.How to re-synchronize pending requests with AM's:
Need new AM-RM API to resend asks from AM to RM.
Keep accumulating asks from AM's like it currently happens when allocate() is 
called.

h4.How to persist AM state:
Store AM info in a persistent ZK node that uses version numbers to prevent out 
of order updates from other RM's. One ZK node per AM under a master RM ZK node. 
AM submission creates ZK node. Start and restart update ZK node. Completion 
clears ZK node.

h4.Metrics:
What needs to be done to maintain consistency across restarts. New app attempt 
would be a new attempt but what about recovered running apps.

h4.Security:
What information about keys and tokens to persist across restart so that 
existing secure containers continue to run with new RM and new containers. ZK 
nodes themelves should be secure.

 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-128) Resurrect RM Restart

2012-09-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13462375#comment-13462375
 ] 

Thomas Graves commented on YARN-128:


{quote}
{quote}
What about AM's that completed during restart. Re-running them should be a 
no-op.
{quote}
AMs should not finish themselves while the RM is down or recovering. They 
should just spin.
{quote}
Doesn't the RM still need to handle this.  The client could stop the AM at any 
point by talking directly to it.  Or since anyone can write an AM it could 
simply finish on its own. Or perhaps timing issue on app finish. How does the 
RM tell the difference?  We can have the MR client/AM handle this nicely but 
even then there could be a bug or expiry after so long.  So perhaps if the AM 
is down it doesn't get restarted?  Thats probably not ideal if app happens to 
go down at the same time as the RM though - like a rack gets rebooted or 
something, but otherwise you have to handle all the restart issues, like Bobby 
mentioned above.



 Resurrect RM Restart 
 -

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira