[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2015-05-01 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523973#comment-14523973
 ] 

Bikas Saha commented on YARN-556:
-

[~jianhe] [~adhoot] [~kasha] [~vinodkv] Should we resolve this jira as complete?

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch, YARN-1372.prelim.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-10-02 Thread Santosh Marella (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157461#comment-14157461
 ] 

Santosh Marella commented on YARN-556:
--

Referencing YARN-2476 here to ensure the specific scenario mentioned there is 
fixed as part of this JIRA.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch, YARN-1372.prelim.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-15 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998513#comment-13998513
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

{code}
Oh. Forgot to mention that. Anubhav Dhoot offered to split up the prototype 
into multiple patches, one for each of the sub-tasks. If I understand right, 
his prototype covers almost all the sub-tasks already created.
{code}

[~adhoot], thanks for your great work. I noticed that you attached a patch on 
YARN-1367. I'll comment there about the patch.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993895#comment-13993895
 ] 

Karthik Kambatla commented on YARN-556:
---

For the scheduler-related work itself, the offline sync up thought it would be 
best to move as much common code as possible to AbstractYarnScheduler. To 
unblock the restart work at the earliest, we should do it in two phases - the 
first phase that only pulls out stuff that would make it easier to handle the 
recovery, and a more comprehensive re-jig later.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996343#comment-13996343
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

Good point, Bikas. Created YARN-2052 for tracking container id discussion. 
[~adhoot], let's discuss there.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994893#comment-13994893
 ] 

Vinod Kumar Vavilapalli commented on YARN-556:
--

Also, if there is a general agreement on how patches should go in which order, 
please create that ordering through JIRA dependencies. Thanks.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994886#comment-13994886
 ] 

Vinod Kumar Vavilapalli commented on YARN-556:
--

Tx for the community update, Karthik.

Also, Jian/Abhinav, can you both please file all the known sub-tasks and assign 
things to yourselves according as you are working on them rightaway? Other 
folks like [~ozawa] and [~rohithsharma] have been requesting repeatedly 
expressed interest to work on this feature. It'll be great to find stuff for 
everyone instead of creating all tickets and assigning them to the two of you. 
Thanks.

[~ozawa] and [~rohithsharma], let others know what you specifically want to 
work on, if you have something in mind.

bq.  6. clustertimestamp is added to containerId so that containerId after RM 
restart do not clash with containerId before (as the containerId counter resets 
to zero in memory)
I totally missed this line item. Can you throw more detail on what the problem 
is and what the proposal is? What is done in the prototype patch is a major 
compatibility issue - I'd like to avoid it if we can.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995165#comment-13995165
 ] 

Karthik Kambatla commented on YARN-556:
---

Oh. Forgot to mention that. [~adhoot] offered to split up the prototype into 
multiple patches, one for each of the sub-tasks. If I understand right, his 
prototype covers almost all the sub-tasks already created.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995328#comment-13995328
 ] 

Anubhav Dhoot commented on YARN-556:


bq. clustertimestamp is added to containerId so that containerId after RM 
restart do not clash with containerId before (as the containerId counter resets 
to zero in memory). 

The problem is the containerId currently is composed of  ApplicationAttemptId + 
int. The int part comes from a in memory containerIdCounter from 
AppSchedulingInfo. This gets reset after a RM restart. Without any changes the 
containerIds for containers allocated after restart would clash with existing 
containerIds. 
The prototype proposal is to make it ApplicationAttemptId + uniqueid + int 
where the uniqueid can be a timestamp set by RM. I feel containerId should be 
an opaque string that YARN app developers don't take a dependency on. Also if 
we used protobuf serialization/deserialization rules everywhere we could deal 
with compatibility changes of different YARN code versions. 

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995511#comment-13995511
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

If we can break the compatibility about the container id, I think Anubhav's 
approach has no problem.
If we cannot do this as [~jianhe] mentioned on YARN-2001, I think epoch idea 
[described 
here|https://issues.apache.org/jira/browse/YARN-2001?focusedCommentId=13995213page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13995213]
 might be used.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995878#comment-13995878
 ] 

Bikas Saha commented on YARN-556:
-

Folks please take the discussion for container id to its own jira. Spreading it 
in the main jira will make it harder to track.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995829#comment-13995829
 ] 

Bikas Saha commented on YARN-556:
-

bq. After the configurable wait-time, the RM starts accepting RPCs from both 
new AMs and already existing AMs.
This is not needed. The AM can be allowed to re-sync after state is recovered 
from the store. Allocations to the AM may not occur until the threshold 
elapses. In fact, we want to re-sync the AM's asap so that they dont give up on 
the RM.

bq. Existing AMs are expected to resync with the RM, which essentially 
translates to register followed by an allocate call
We should keep the option open to use a new API called resync that does exactly 
that. It may help to make this operation atomic





 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-04-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985148#comment-13985148
 ] 

Jian He commented on YARN-556:
--

Hi Anubhav,
Looked at the prototype patch. Regarding the approach, it’s better to have a 
scheduler-agnostic recovery mechanism with no or minimum  scheduler-specific 
changes, instead of implementing each scheduler specifically. YARN-1368 can be 
renamed to accommodate  the necessary common changes for all schedulers.Also, 
adding cluster timestamp to the container Id doesn’t  seem right and that’ll 
also break compatibility. 


 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-04-21 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975541#comment-13975541
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

[~adhoot], I glanced over your patch.

1. Can you split your code into each subtasks? Your patch includes overall 
changes of this task. We should discuss small points on each subtask JIRA.
2. IMO, prototype is enough to validate the design. Do you have any additional 
comments about design docs?

I'd like to include this feature in 2.5.0(maybe May - June?), so let's work 
togather :-)

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-04-18 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974724#comment-13974724
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

Anubhav, Thank you for sharing the prototype. I will try it this weekend. 

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-24 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945437#comment-13945437
 ] 

Bikas Saha commented on YARN-556:
-

Please align with the design doc while prototyping. If the design needs changes 
then please update the document. The sub-tasks need to follow the design doc so 
that other folks can follow even if they are not writing the code.

Some pieces of this are already underway in trunk (eg. RM not killing the 
containers on app attempt exit). The scheduler changes are the most complex 
piece. But they can come in the end. Working on trunk allows breaks/bugs to be 
caught quicker and forces us to be more methodical in our approach. A branch is 
useful when its not clear what approach to take or when we know the code is 
going to be broken across commits. So I would prefer we do this on trunk.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-24 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945480#comment-13945480
 ] 

Karthik Kambatla commented on YARN-556:
---

bq. Please align with the design doc while prototyping. If the design needs 
changes then please update the document. The sub-tasks need to follow the 
design doc so that other folks can follow even if they are not writing the code.
Yes, that is the idea. The prototype should be mostly ready by end of the week. 
Will update the document with any minor changes we see are required, along with 
a prototype. 

bq. The scheduler changes are the most complex piece. But they can come in the 
end.
Without the scheduler changes, I am concerned the remaining patches would only 
break things. The alternative is to have a config to enable work-preserving 
restart and guard all changes by that config - I am not yet fully convinced of 
this approach, would we want to leave this config even after the feature is 
complete? 

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-24 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945587#comment-13945587
 ] 

Vinod Kumar Vavilapalli commented on YARN-556:
--

I don't see the value of a prototype given we have a mostly concrete design. 
It's fine to do it, but let's make sure we are not taking shortcuts in the 
interest of getting a quick  dirty version up.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-24 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945650#comment-13945650
 ] 

Karthik Kambatla commented on YARN-556:
---

We think the prototype would be a validation of the design. Individual 
sub-tasks will go through the same rigor of unit tests and code review. It 
would help to add further details to the design or evaluate any minor changes 
required before committing the sub-tasks. 

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-23 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944730#comment-13944730
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

[~jianhe], your approach looks good to me. We can test new features with the 
updated protocol. About the NM side, we can choose switch on/off the NM resync 
by using configuration.

[~kkambatl] and [~adhoot], can you attach prototype source code to JIRAs? I'd 
like to contribute this JIRA and work with you.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-22 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943943#comment-13943943
 ] 

Jian He commented on YARN-556:
--

IMO, I would prefer work from the protocol changes first, RM can choose to 
ignore the container statuses reports for the time being. It's not able to test 
on a real cluster if we make scheduler changes only, since there are no real 
entities to report the container statuses. If possible, I'd like this happen on 
trunk since this can be deeply coupled inside RM, we can catch bugs as early as 
possible and also avoid the merge nightmare. Thoughts?

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-22 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943944#comment-13943944
 ] 

Jian He commented on YARN-556:
--

Or we can work from a branch first and then move to trunk once it's in a good 
shape.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-03-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943789#comment-13943789
 ] 

Karthik Kambatla commented on YARN-556:
---

Thanks for posting the design doc, [~bikassaha]. [~adhoot] and I have been 
working on this for the past few days towards an initial prototype, so we get a 
handle on all the items required.

In terms of actual work-items (JIRAs), I wonder if it makes sense to work in a 
branch. Making the AM, NM resync changes without the scheduler changes would 
break things. We can work on the scheduler changes first, so there is no caller 
and add resync later, but I suppose that would make it hard to test outside of 
unit tests.

Thoughts? 

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2013-10-30 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808815#comment-13808815
 ] 

Bikas Saha commented on YARN-556:
-

Added some coarse grained tasks based on the attached proposal. More tasks may 
be added as details get dissected.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2013-10-08 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789862#comment-13789862
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

Hi Bikas, can you share the current state about this JIRA?

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2013-10-08 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789919#comment-13789919
 ] 

Bikas Saha commented on YARN-556:
-

Thanks for the reminder. Based on the attached proposal, I am going to create 
sub-tasks of this jira. Contributors are free to pick up those tasks.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)