[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2016-03-03 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178653#comment-15178653
 ] 

Vinod Kumar Vavilapalli commented on YARN-1489:
---

bq. That and the "Old running containers don't know where the new AM is 
running." issue is big enough that we shouldn't close this umbrella as done.
Just filed YARN-4758.

> [Umbrella] Work-preserving ApplicationMaster restart
> 
>
> Key: YARN-1489
> URL: https://issues.apache.org/jira/browse/YARN-1489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2016-02-25 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167132#comment-15167132
 ] 

Junping Du commented on YARN-1489:
--

Another big problem is we don't actually notify a restarted AM the 
"finished-in-the-interim" containers. In YARN-1041, RM only report the running 
containers to AM new attempt, but if container get finished during this time, 
new AM attempt has no ways to know it - that make different behaviors for AMs - 
MR will rely on job history log to recover tasks, while Distributed Shell will 
launch these (finished) containers again. I think we should enhance this.
Will file a separated JIRA if we don't have it yet.


> [Umbrella] Work-preserving ApplicationMaster restart
> 
>
> Key: YARN-1489
> URL: https://issues.apache.org/jira/browse/YARN-1489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2016-02-24 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162794#comment-15162794
 ] 

Junping Du commented on YARN-1489:
--

bq. That and the "Old running containers don't know where the new AM is 
running." issue is big enough that we shouldn't close this umbrella as done.
I don't think we have an open JIRA under this umbrella to track this issue. Is 
this a specific issue for MR (like we discussed on MAPREDUCE-6608) or a generic 
issue for other frameworks (Spark, etc.) too? YARN-4602 get filed to track this 
issue as a generic problem for messages pass between containers.

> [Umbrella] Work-preserving ApplicationMaster restart
> 
>
> Key: YARN-1489
> URL: https://issues.apache.org/jira/browse/YARN-1489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2016-02-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160061#comment-15160061
 ] 

Vinod Kumar Vavilapalli commented on YARN-1489:
---

That and the "Old running containers don't know where the new AM is running." 
issue is big enough that we shouldn't close this umbrella as done.

> [Umbrella] Work-preserving ApplicationMaster restart
> 
>
> Key: YARN-1489
> URL: https://issues.apache.org/jira/browse/YARN-1489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2016-01-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096879#comment-15096879
 ] 

Karthik Kambatla commented on YARN-1489:


The two unassigned open JIRAs still seem very valid. One of them might be 
related to YARN-1815. I am comfortable with converting them to issues and 
closing the umbrella JIRA.

> [Umbrella] Work-preserving ApplicationMaster restart
> 
>
> Key: YARN-1489
> URL: https://issues.apache.org/jira/browse/YARN-1489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2016-01-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096860#comment-15096860
 ] 

Junping Du commented on YARN-1489:
--

Hi [~vinodkv], [~jianhe] and [~ka...@cloudera.com], is this feature already 
completed? If so, may be we can close it as fixed?

> [Umbrella] Work-preserving ApplicationMaster restart
> 
>
> Key: YARN-1489
> URL: https://issues.apache.org/jira/browse/YARN-1489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2014-05-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993962#comment-13993962
 ] 

Karthik Kambatla commented on YARN-1489:


Created a couple of sub-tasks based on an offline discussion with Anubhav, 
Bikas, Jian and Vinod.

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: Work preserving AM restart.pdf


 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2014-01-11 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868891#comment-13868891
 ] 

Bikas Saha commented on YARN-1489:
--

We need to come to a conclusion on how to allow the containers to also find out 
about the new AM's.
Something we have discussed in the past
1) New AM upon register provides an payload to the RM
2) RM syncs the payload with the NMs on heartbeat. RM-NM already sync on 
running application state. This payload could piggyback on that.
3) A container on an NM could query the NM about its own AM's payload. This 
local API could be secured by a local token and available to only containers 
running on the local node.
4) This payload would be used by the containers to reconnect with the AM (in 
case systems dont use external solutions like zookeeper for such tracking.

This sounds reasonably light-weight, scalable and self-contained. All the 
interested parties would be informed within 2*(NmHeartbeat) time interval.

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: Work preserving AM restart.pdf


 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2014-01-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864212#comment-13864212
 ] 

Steve Loughran commented on YARN-1489:
--

regarding the rebinding problem, YARN-913 proposes some registry where we 
restrict the names of services and apps, and require uniqueness. This lets us 
register something like (hoya, stevel, accumulo5) and then let a client app 
look it up.

Today we have the list of running apps, and you can find and bind to one, but
# there's nothing to stop a single user having 1 instance of the same name
# there's no way for a AM to enumerate this as the list operation isn't in the 
AMRM protocol



 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: Work preserving AM restart.pdf


 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2014-01-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864365#comment-13864365
 ] 

Steve Loughran commented on YARN-1489:
--

Actually, the simplest way for an AM to work with a restarted cluster would be 
if there was a blocking operation to list active containers. At startup it 
could get that list and use it to init its data structures -on a first start 
the list would be empty.

Alternatively, the restart information could be passed down in 
{{RegisterApplicationMasterResponse}} -which would avoid adding any new RPC 
calls

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: Work preserving AM restart.pdf


 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2014-01-07 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864382#comment-13864382
 ] 

Bikas Saha commented on YARN-1489:
--

The POR is the attempt AMRM register RPC to return the currently running 
containers for that app. So when the attempt makes the initial sync with the RM 
then it will get all that info.

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: Work preserving AM restart.pdf


 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2014-01-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862359#comment-13862359
 ] 

Bikas Saha commented on YARN-1489:
--

Here is an idea:
The RM allows the app to send it some data during registration. This data could 
include the AM port information etc. The RM could then sync this data with the 
NM during NM heartbeat. The NM anyways maintain per app attempt info and this 
data would be added to that. The containers running on an AM could query for 
this attempt data and get the information about the new app attempt. This would 
be a scalable and efficient solution.
The data per NM will be small since the data would be size checked and 
proportional to the app attempts. The NM could give access to an attempts data 
only to the containers that belong to that attempt. Only local containers 
should be able to communicate with their NM for such information. This could be 
done via a local access token that is supplied by the NM whenever it launches a 
container.

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: Work preserving AM restart.pdf


 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2013-12-30 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859156#comment-13859156
 ] 

Zhijie Shen commented on YARN-1489:
---

Thanks Vinod for the proposal. One thought when I read the following point.

bq. In case of apps like MapReduce where containers need to communicate 
directly with AMs, the old running-containers don’t know where the new 
ApplicationMaster is running and how to reach it (service addresses).

During AM restarting, the container may try to send messages to AM in some 
application, and these messages may get lost. Is good to buffer the outstanding 
messages and send them to AM when rebinding?

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: Work preserving AM restart.pdf


 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846072#comment-13846072
 ] 

Vinod Kumar Vavilapalli commented on YARN-1489:
---

bq. Would be good to see an overall design document..
Yup, writing something up..

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2013-12-10 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845124#comment-13845124
 ] 

Bikas Saha commented on YARN-1489:
--

Would be good to see an overall design document, specially for the tricky 
pieces like reconnecting existing running containers to new app attempts.

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)