[jira] [Updated] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-11-15 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Attachment: MAPREDUCE-6726-MAPREDUCE-6608.003.patch

Incorporating comments from [~jianhe]  Will cull out the JvmId changes into a 
separate patch

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.002.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.003.patch, WorkPreservingMRAppMaster.pdf
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-11-15 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Attachment: MAPREDUCE-6726-MAPREDUCE-6608.003.patch

Addressing comments by [~jianhe]  I will cull out the JVMid changes into a 
separate patch.

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.002.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.003.patch, WorkPreservingMRAppMaster.pdf
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-11-15 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Attachment: (was: MAPREDUCE-6726-MAPREDUCE-6608.003.patch)

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.002.patch, WorkPreservingMRAppMaster.pdf
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-11-15 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Comment: was deleted

(was: Incorporating comments from [~jianhe]  Will cull out the JvmId changes 
into a separate patch)

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.002.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.003.patch, WorkPreservingMRAppMaster.pdf
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-09-26 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525160#comment-15525160
 ] 

Srikanth Sampath commented on MAPREDUCE-6726:
-

Thanks [~jianhe] for your comments.  I will do the needful

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.002.patch, WorkPreservingMRAppMaster.pdf
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-09-21 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Attachment: MAPREDUCE-6726-MAPREDUCE-6608.002.patch

Refactored with enhancing the JvmID to include the application attempt Id

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> MAPREDUCE-6726-MAPREDUCE-6608.002.patch, WorkPreservingMRAppMaster.pdf
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6754) Container Ids for an yarn application should be monotonically increasing in the scope of the application

2016-08-25 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436757#comment-15436757
 ] 

Srikanth Sampath commented on MAPREDUCE-6754:
-

Thanks much [[~jlowe][[~vinodkv][[~jianhe]  Given the discussion above, we can 
just add the attemptId to the JvmId for now.  I will consider nuking out JvmId 
separately.  

> Container Ids for an yarn application should be monotonically increasing in 
> the scope of the application
> 
>
> Key: MAPREDUCE-6754
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6754
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
>
> Currently across application attempts, container Ids are reused.  The 
> container id is stored in AppSchedulingInfo and it is reinitialized with 
> every application attempt.  So the containerId scope is limited to the 
> application attempt.
> In the MR Framework, It is important to note that the containerId is being 
> used as part of the JvmId.  JvmId has 3 components  containerId>.  The JvmId is used in datastructures in TaskAttemptListener and 
> is passed between the AppMaster and the individual tasks.  For an application 
> attempt, no two tasks have the same JvmId.
> This works well currently, since inflight tasks get killed if the AppMaster 
> goes down.  However, if we want to enable WorkPreserving nature for the AM, 
> containers (and hence containerIds) live across application attempts.  If we 
> recycle containerIds across attempts, then two independent tasks (one 
> inflight from a previous attempt  and another new in a succeeding attempt) 
> can have the same JvmId and cause havoc.
> This can be solved in two ways:
> *Approach A*: Include attempt id as part of the JvmId. This is a viable 
> solution, however, there is a change in the format of the JVMid. Changing 
> something that has existed so long for an optional feature is not persuasive.
> *Approach B*: Keep the container id to be a monotonically increasing id for 
> the life of an application. So, container ids are not reused across 
> application attempts containers should be able to outlive an application 
> attempt. This is the preferred approach as it is clean, logical and is 
> backwards compatible. Nothing changes for existing applications or the 
> internal workings.  
> *How this is achieved:*
> Currently, we maintain latest containerId only for application attempts and 
> reinitialize for new attempts.  With this approach, we retrieve the latest 
> containerId from the just-failed attempt and initialize the new attempt with 
> the latest containerId (instead of 0).   I can provide the patch if it helps. 
>  It currently exists in MAPREDUCE-6726



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6754) Container Ids for an yarn application should be monotonically increasing in the scope of the application

2016-08-12 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6754:

Summary: Container Ids for an yarn application should be monotonically 
increasing in the scope of the application  (was: Container Ids for a yarn 
application should be monotonically increasing in the scope of the application)

> Container Ids for an yarn application should be monotonically increasing in 
> the scope of the application
> 
>
> Key: MAPREDUCE-6754
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6754
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
>
> Currently across application attempts, container Ids are reused.  The 
> container id is stored in AppSchedulingInfo and it is reinitialized with 
> every application attempt.  So the containerId scope is limited to the 
> application attempt.
> In the MR Framework, It is important to note that the containerId is being 
> used as part of the JvmId.  JvmId has 3 components  containerId>.  The JvmId is used in datastructures in TaskAttemptListener and 
> is passed between the AppMaster and the individual tasks.  For an application 
> attempt, no two tasks have the same JvmId.
> This works well currently, since inflight tasks get killed if the AppMaster 
> goes down.  However, if we want to enable WorkPreserving nature for the AM, 
> containers (and hence containerIds) live across application attempts.  If we 
> recycle containerIds across attempts, then two independent tasks (one 
> inflight from a previous attempt  and another new in a succeeding attempt) 
> can have the same JvmId and cause havoc.
> This can be solved in two ways:
> *Approach A*: Include attempt id as part of the JvmId. This is a viable 
> solution, however, there is a change in the format of the JVMid. Changing 
> something that has existed so long for an optional feature is not persuasive.
> *Approach B*: Keep the container id to be a monotonically increasing id for 
> the life of an application. So, container ids are not reused across 
> application attempts containers should be able to outlive an application 
> attempt. This is the preferred approach as it is clean, logical and is 
> backwards compatible. Nothing changes for existing applications or the 
> internal workings.  
> *How this is achieved:*
> Currently, we maintain latest containerId only for application attempts and 
> reinitialize for new attempts.  With this approach, we retrieve the latest 
> containerId from the just-failed attempt and initialize the new attempt with 
> the latest containerId (instead of 0).   I can provide the patch if it helps. 
>  It currently exists in MAPREDUCE-6726



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6754) Container Ids for a yarn application should be monotonically increasing in the scope of the application

2016-08-12 Thread Srikanth Sampath (JIRA)
Srikanth Sampath created MAPREDUCE-6754:
---

 Summary: Container Ids for a yarn application should be 
monotonically increasing in the scope of the application
 Key: MAPREDUCE-6754
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6754
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Srikanth Sampath
Assignee: Srikanth Sampath


Currently across application attempts, container Ids are reused.  The container 
id is stored in AppSchedulingInfo and it is reinitialized with every 
application attempt.  So the containerId scope is limited to the application 
attempt.

In the MR Framework, It is important to note that the containerId is being used 
as part of the JvmId.  JvmId has 3 components .  
The JvmId is used in datastructures in TaskAttemptListener and is passed 
between the AppMaster and the individual tasks.  For an application attempt, no 
two tasks have the same JvmId.

This works well currently, since inflight tasks get killed if the AppMaster 
goes down.  However, if we want to enable WorkPreserving nature for the AM, 
containers (and hence containerIds) live across application attempts.  If we 
recycle containerIds across attempts, then two independent tasks (one inflight 
from a previous attempt  and another new in a succeeding attempt) can have the 
same JvmId and cause havoc.

This can be solved in two ways:

*Approach A*: Include attempt id as part of the JvmId. This is a viable 
solution, however, there is a change in the format of the JVMid. Changing 
something that has existed so long for an optional feature is not persuasive.

*Approach B*: Keep the container id to be a monotonically increasing id for the 
life of an application. So, container ids are not reused across application 
attempts containers should be able to outlive an application attempt. This is 
the preferred approach as it is clean, logical and is backwards compatible. 
Nothing changes for existing applications or the internal workings.  
*How this is achieved:*
Currently, we maintain latest containerId only for application attempts and 
reinitialize for new attempts.  With this approach, we retrieve the latest 
containerId from the just-failed attempt and initialize the new attempt with 
the latest containerId (instead of 0).   I can provide the patch if it helps.  
It currently exists in MAPREDUCE-6726



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-08-07 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Attachment: WorkPreservingMRAppMaster.pdf

Design document that outlines some of the issues, alternatives and solutions.

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch, 
> WorkPreservingMRAppMaster.pdf
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-07-31 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Attachment: MAPREDUCE-6726-MAPREDUCE-6608.001.patch

SS_DEBUG - Log statements that will be retain till the implementation and 
stabilization is complete.  They will either be removed or converted to log 
messages
SS_FIXME - These are known issues that will be addressed in subsequent patches

I will update the design doc on MAPREDUCE-6608

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-07-31 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Comment: was deleted

(was: SS_DEBUG - Log statements that will be retain till the implementation and 
stabilization is complete.  They will either be removed or converted to log 
messages
SS_FIXME - These are known issues that will be addressed in subsequent patches

I will update the design doc on MAPREDUCE-6608)

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
> Attachments: MAPREDUCE-6726-MAPREDUCE-6608.001.patch
>
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6726) YARN Registry based AM discovery with retry and in-flight task persistent via JHS

2016-07-31 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6726:

Status: Patch Available  (was: Open)

SS_DEBUG - Log statements that will be retain till the implementation and 
stabilization is complete.  They will either be removed or converted to log 
messages
SS_FIXME - These are known issues that will be addressed in subsequent patches

I will update the design doc on MAPREDUCE-6608

> YARN Registry based AM discovery with retry and in-flight task persistent via 
> JHS
> -
>
> Key: MAPREDUCE-6726
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6726
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster
>Reporter: Junping Du
>Assignee: Srikanth Sampath
>
> Several tasks will be achieved in this JIRA based on the demo patch in 
> MAPREDUCE-6608:
> 1. AM discovery base on YARN register service. Could be replaced by YARN-4758 
> later due to scale up issue.
> 2. Retry logic for TaskUmbilicalProtocol RPC connection
> 3. In-flight task recover after AM restart via JHS
> 4. Configuration to control the behavior compatible with previous when not 
> enable this feature (by default).
> All security related issues and other concerns discussed in MAPREDUCE-6608 
> will be addressed in follow up JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-04-17 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244727#comment-15244727
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks [~vinodkv] for your comments.  I will dig deeper into these and respond. 
 

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-25 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167140#comment-15167140
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks [~djp].  Few points:
1. The reads will only be for the inflight tasks out of the large MR job.  That 
said, it is possible for it to be large - for example multiple AMs fail.

2. The read path in this case is required for communication between the task 
containers and the AM (not between task containers).  So indeed it is a subset 
of the cases addressed in 
[YARN-4602|https://issues.apache.org/jira/browse/YARN-4602].

3. We need more details on how 
[YARN-4602|https://issues.apache.org/jira/browse/YARN-4602] will be addressed.  
What's the latency for the payload to make it from the new AM to the registry 
(RM) and then to the NMs.  How will the task containers fetch the new address. 
Should we still have the registry based read path work as a fallback.

I will be very happy to work with you in parallel on this.  

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-24 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162887#comment-15162887
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks much [~djp] for your review and comments.  Appreciate it very much. 

*Issue 1*
{quote}+1 on Vinod's proposal of separating write and read path.{quote}
I agree and will log a separate YARN JIRA.  Do you think that effort should be 
linked to this work or can be done separately and later incorporated.  Given 
your suggestion for optimizing - using the service record for other attempts 
(not the first one) the read paths will be much fewer.  

*Issue 2*
{quote} We can involve a new MR config to switch on/off this feature (off by 
default). However, I didn't see any implementation on this in demo patch {quote}
Yes, not in the demo patch.  I will add it in the next version and also 
maintain the old code path when the configuration is off (the default).

{quote} Beside we need to replace the read path of registry service, another 
point is we don't necessary to keep the first attempt AM info which could 
saving most of overhead we are adding here as most applications are expected to 
end with single attempt. Isn't it? {quote}
Yes.  That's correct.  Very good suggestion.

{quote}Agree that named argument sounds better. However, this way has work for 
a long time for MapReduce project and we won't prefer to change unless we find 
some issue or bug. For path to service record, we need keep consistent with our 
decision on read path. {quote}
I think named arguments are better.  If we end up changing the interface of 
YarnChild, I think we should do it.  It depends on what we decide on *Issue 1*

{quote}UmbilicalWithRetries should follow other existing practice (for RPC 
client retry during service down time) that to create a RetryProxy with 
FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact 
with new attempt instance for AM.{quote}
Thanks much for this very useful suggestion.  I will incorporate it.

Please let me know your recommendation on *Issue 1*

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-07 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6608:

Attachment: WorkPreservingMRAppMaster-2.pdf

Updated high level design

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-07 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136279#comment-15136279
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

I have attached a design patch - 
[Patch1|https://issues.apache.org/jira/secure/attachment/12786705/Patch1.patch] 
that gives a high level approach on the implementation.  The 
[Design|https://issues.apache.org/jira/secure/attachment/12786706/WorkPreservingMRAppMaster-2.pdf]
 document gives the high level design.

*Notes:*
1. This is a patch against Apache 2.6.1
2. It works for the example hadoop sleep job - where I have killed the  AM 
randomly and the inflight tasks continue.
3. SS_DEBUG in the patch indicates a debug statement that helps me. Some of 
these will be removed eventually.
4. SS_FIXME in the patch is a tag for me to fix some known issues that I have 
commented on.  I will clean these up before the next submission.

I solicit comments on the high level design and the approach I have taken in 
the patch.

*Next Steps:*
1. I will iron out the known issues (all SS_FIXME), clean up the interfaces,  
make the code compliant with apache coding standards, rebase the code against 
trunk, and test it thoroughly.  I will factor in the comments and suggestions 
that are made with the design doc and design patch.
2. Identify the components and issues involved and raise sub tasks.  

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-07 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6608:

Attachment: Patch1.patch

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-02 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128785#comment-15128785
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

I have attached the high level approach used in the solution that is in the 
works.  I will update with a design patch and a detailed design soon.

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-02-02 Thread Srikanth Sampath (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sampath updated MAPREDUCE-6608:

Attachment: WorkPreservingMRAppMaster-1.pdf

High Level Approach 

> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster-1.pdf, 
> WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6608) Work Preserving AM Restart for MapReduce

2016-01-18 Thread Srikanth Sampath (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106254#comment-15106254
 ] 

Srikanth Sampath commented on MAPREDUCE-6608:
-

Thanks [~djp] for your comments.  Yes indeed there are more issues with it, I 
have a working version of it which uses the registry.  I have been in 
discussion with [~vvasudev].  There are some loose ends, which I will tie up 
and upload a patch and an updated design next week.


> Work Preserving AM Restart for MapReduce
> 
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Srikanth Sampath
>Assignee: Srikanth Sampath
> Attachments: WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in 
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489].  We would like 
> to take advantage of this for MapReduce(MR) applications.  There are some 
> challenges which have been described in the attached document and few options 
> discussed.  We solicit feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)