[jira] [Commented] (YARN-5357) Timeline service v2 integration with Federation

2023-08-17 Thread Jiandan Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755773#comment-17755773
 ] 

Jiandan Yang  commented on YARN-5357:
-

Hello, [~abmodi] [~vrushalic]

I've noticed that this JIRA hasn't seen any updates in over four years. I'm 
interested in continuing the development on this issue. I was wondering if you 
are still following or working on this topic?

I'd like to build upon the discussions that have taken place here. If anyone 
involved in the previous discussions could provide or point me to the proposed 
solutions or strategies discussed earlier, that would be immensely helpful.

Looking forward to your insights and feedback as I venture into progressing 
this issue further. 

Thank you for your time and consideration.

> Timeline service v2 integration with Federation 
> 
>
> Key: YARN-5357
> URL: https://issues.apache.org/jira/browse/YARN-5357
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Abhishek Modi
>Priority: Major
>
> Jira to note the discussion points from an initial chat about integrating 
> Timeline Service v2 with Federation (YARN-2915).
> cc [~subru] [~curino] 
> For Federation:
> - all entities that belong to the same flow run should have the same cluster 
> name
> - app id in the same flow run strongly ordered in time
> - need a logical cluster name and physical cluster name
> - a possibility to implement the Application TimelineCollector as an 
> interceptor in the AMRMProxyService.
> For Timeline Service:
> - need to store physical cluster id and logical cluster id so that we don't 
> lose information at any level (flow/app/entity etc)
> - add a  new table app id to cluster mapping table
> - need a different entity table/some table to store node level metrics for 
> physical cluster stats. Once we get to node-level rollup, we probably have to 
> store something in a dc, cluster, rack, node hierarchy. In that case a 
> physical cluster makes sense, but we'd still need some way to tie physical 
> and logical together in order to make automatic error detection etc that 
> we're envisioning feasible within a federated setup.
> For the Cluster Naming convention:
> - three situations for cluster name:
> > app submitted to router should take federated (aka logical) cluster name
> > app submitted directly to RM should take physical cluster name
> > Info about the physical cluster  in entities?
> - suggestion to set the cluster name as yarn tag at the router level (in the 
> app submission context) 
> Other points to note:
> - for federation to work smoothly in environments that use HDFS some 
> additional considerations are needed, and possibly some solution like what is 
> being used at Twitter with the nFly approach.
> Email thread context:
> {code}
> -- Forwarded message --
> From: Joep Rottinghuis 
> Date: Fri, Jul 8, 2016 at 1:22 PM
> Subject: Re: Federation -Timeline Service meeting notes
> To: Subramaniam Venkatraman Krishnan 
> Cc: Sangjin Lee, Vrushali Channapattan , Carlo Curino
> Thanks for the notes.
> I think that for federation to work smoothly in environments that use HDFS 
> some additional considerations are needed, and possibly some solution like 
> what we're using at Twitter with our nFly approach.
> bq. - need a different entity table/some table to store node level metrics 
> for physical cluster stats
> Once we get to node-level rollup, we probably have to store something in a 
> dc, cluster, rack, node hierarchy. In that case a physical cluster makes 
> sense, but we'd still need some way to tie physical and logical together in 
> order to make automatic error detection etc that we're envisioning feasible 
> within a federated setup.
> Cheers,
> Joep
> On Fri, Jul 8, 2016 at 1:00 PM, Subramaniam Venkatraman Krishnan  wrote:
> Thanks Vrushali for crisply capturing the essential from our rambling 
> discussion J.
>  
> Sangjin, I just want to add one comment to yours – we want to retain the 
> physical cluster name (possibly as a new entity type) so that we don’t lose 
> information & we can cluster level rollups even if they are not efficient.
>  
> Additionally, based on the walkthrough of Federation design:
> · There was general agreement with the proposed approach.
> · There is a possibility to implement the Application 
> TimelineCollector as an interceptor in the AMRMProxyService.
> · Joep raised the concern that it would be better if the RMs 
> obtain the epoch from FederationStateStore. This is not currently in the 
> roadmap of our MVP but we definitely plan to address this in future.
>  
> Regards,
> Subru
>  
> From: Sangjin Lee

[jira] [Commented] (YARN-5357) Timeline service v2 integration with Federation

2019-01-29 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755751#comment-16755751
 ] 

Vrushali C commented on YARN-5357:
--

Notes from our discussion around this. Abhishek, Sushil, Prabha, Rohith and I 
attended the discussion call. 

Current situation:
Timeline Service collector's address & port are sent from the RM to NMs which 
run containers for this app to publish the system metrics from the NM. In case 
of federated yarn clusters, since an Uberized AM is used for coordinating with 
containers running in foreign subclusters, the NMs in foreign sub-clusters do 
not have the timeline service collector address.

This is not specific to federation but is a design item to be solved for any 
unmanaged AM. 

We discussed a few approaches. One of which was allowing the AM itself to send 
the timeline service address & port to other foreign subclusters' RM and then 
propagating it from there. But this has some issues. A malicious AM might try 
to "game" the system by sending those a dummy collector address to some NMs. 
This way, some NMs' system metrics mysteriously disappear before being reported 
and therefore chargeback & accountability will become incorrect. 

Another idea Abhishek proposed was around picking a random NM for starting up a 
timeline-service collector when the AM is uberized. He will be working on the 
design idea further and will discuss with his team mates and we can all review 
it when he has consolidated his thoughts. 



> Timeline service v2 integration with Federation 
> 
>
> Key: YARN-5357
> URL: https://issues.apache.org/jira/browse/YARN-5357
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Prabha Manepalli
>Priority: Major
>
> Jira to note the discussion points from an initial chat about integrating 
> Timeline Service v2 with Federation (YARN-2915).
> cc [~subru] [~curino] 
> For Federation:
> - all entities that belong to the same flow run should have the same cluster 
> name
> - app id in the same flow run strongly ordered in time
> - need a logical cluster name and physical cluster name
> - a possibility to implement the Application TimelineCollector as an 
> interceptor in the AMRMProxyService.
> For Timeline Service:
> - need to store physical cluster id and logical cluster id so that we don't 
> lose information at any level (flow/app/entity etc)
> - add a  new table app id to cluster mapping table
> - need a different entity table/some table to store node level metrics for 
> physical cluster stats. Once we get to node-level rollup, we probably have to 
> store something in a dc, cluster, rack, node hierarchy. In that case a 
> physical cluster makes sense, but we'd still need some way to tie physical 
> and logical together in order to make automatic error detection etc that 
> we're envisioning feasible within a federated setup.
> For the Cluster Naming convention:
> - three situations for cluster name:
> > app submitted to router should take federated (aka logical) cluster name
> > app submitted directly to RM should take physical cluster name
> > Info about the physical cluster  in entities?
> - suggestion to set the cluster name as yarn tag at the router level (in the 
> app submission context) 
> Other points to note:
> - for federation to work smoothly in environments that use HDFS some 
> additional considerations are needed, and possibly some solution like what is 
> being used at Twitter with the nFly approach.
> Email thread context:
> {code}
> -- Forwarded message --
> From: Joep Rottinghuis 
> Date: Fri, Jul 8, 2016 at 1:22 PM
> Subject: Re: Federation -Timeline Service meeting notes
> To: Subramaniam Venkatraman Krishnan 
> Cc: Sangjin Lee, Vrushali Channapattan , Carlo Curino
> Thanks for the notes.
> I think that for federation to work smoothly in environments that use HDFS 
> some additional considerations are needed, and possibly some solution like 
> what we're using at Twitter with our nFly approach.
> bq. - need a different entity table/some table to store node level metrics 
> for physical cluster stats
> Once we get to node-level rollup, we probably have to store something in a 
> dc, cluster, rack, node hierarchy. In that case a physical cluster makes 
> sense, but we'd still need some way to tie physical and logical together in 
> order to make automatic error detection etc that we're envisioning feasible 
> within a federated setup.
> Cheers,
> Joep
> On Fri, Jul 8, 2016 at 1:00 PM, Subramaniam Venkatraman Krishnan  wrote:
> Thanks Vrushali for crisply capturing the essential from our rambling 
> discussion J.
>  
> Sangjin, I just want to add one comment to yours – we want to retain the

[jira] [Commented] (YARN-5357) Timeline service v2 integration with Federation

2019-01-03 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733883#comment-16733883
 ] 

Abhishek Modi commented on YARN-5357:
-

One more approach that we discussed was using AMRMProxyService to update 
foreign RMs about the timeline collector info for Application masters running 
on this nodemanager. Flow would be something like this:
1. AM gets launched on the node. NM starts timeline collector service and 
updates home subcluster RM and AM about the collector info.
2. When AM requests for a container to foreign subcluster -  AMRMProxyService 
will register a UAM to get registered with foreign RM.
3. As part of registration request, AMRMProxyService will send collector info 
about this application.
4. RM will update the set of collector info it maintains and will send the 
collector info to corresponding where container for this particular application 
starts.


> Timeline service v2 integration with Federation 
> 
>
> Key: YARN-5357
> URL: https://issues.apache.org/jira/browse/YARN-5357
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Prabha Manepalli
>Priority: Major
>
> Jira to note the discussion points from an initial chat about integrating 
> Timeline Service v2 with Federation (YARN-2915).
> cc [~subru] [~curino] 
> For Federation:
> - all entities that belong to the same flow run should have the same cluster 
> name
> - app id in the same flow run strongly ordered in time
> - need a logical cluster name and physical cluster name
> - a possibility to implement the Application TimelineCollector as an 
> interceptor in the AMRMProxyService.
> For Timeline Service:
> - need to store physical cluster id and logical cluster id so that we don't 
> lose information at any level (flow/app/entity etc)
> - add a  new table app id to cluster mapping table
> - need a different entity table/some table to store node level metrics for 
> physical cluster stats. Once we get to node-level rollup, we probably have to 
> store something in a dc, cluster, rack, node hierarchy. In that case a 
> physical cluster makes sense, but we'd still need some way to tie physical 
> and logical together in order to make automatic error detection etc that 
> we're envisioning feasible within a federated setup.
> For the Cluster Naming convention:
> - three situations for cluster name:
> > app submitted to router should take federated (aka logical) cluster name
> > app submitted directly to RM should take physical cluster name
> > Info about the physical cluster  in entities?
> - suggestion to set the cluster name as yarn tag at the router level (in the 
> app submission context) 
> Other points to note:
> - for federation to work smoothly in environments that use HDFS some 
> additional considerations are needed, and possibly some solution like what is 
> being used at Twitter with the nFly approach.
> Email thread context:
> {code}
> -- Forwarded message --
> From: Joep Rottinghuis 
> Date: Fri, Jul 8, 2016 at 1:22 PM
> Subject: Re: Federation -Timeline Service meeting notes
> To: Subramaniam Venkatraman Krishnan 
> Cc: Sangjin Lee, Vrushali Channapattan , Carlo Curino
> Thanks for the notes.
> I think that for federation to work smoothly in environments that use HDFS 
> some additional considerations are needed, and possibly some solution like 
> what we're using at Twitter with our nFly approach.
> bq. - need a different entity table/some table to store node level metrics 
> for physical cluster stats
> Once we get to node-level rollup, we probably have to store something in a 
> dc, cluster, rack, node hierarchy. In that case a physical cluster makes 
> sense, but we'd still need some way to tie physical and logical together in 
> order to make automatic error detection etc that we're envisioning feasible 
> within a federated setup.
> Cheers,
> Joep
> On Fri, Jul 8, 2016 at 1:00 PM, Subramaniam Venkatraman Krishnan  wrote:
> Thanks Vrushali for crisply capturing the essential from our rambling 
> discussion J.
>  
> Sangjin, I just want to add one comment to yours – we want to retain the 
> physical cluster name (possibly as a new entity type) so that we don’t lose 
> information & we can cluster level rollups even if they are not efficient.
>  
> Additionally, based on the walkthrough of Federation design:
> · There was general agreement with the proposed approach.
> · There is a possibility to implement the Application 
> TimelineCollector as an interceptor in the AMRMProxyService.
> · Joep raised the concern that it would be better if the RMs 
> obtain the epoch from FederationStateStore. This is not currently in the 

[jira] [Commented] (YARN-5357) Timeline service v2 integration with Federation

2018-12-30 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731210#comment-16731210
 ] 

Abhishek Modi commented on YARN-5357:
-

I further checked this and found that all ATSv2 events are not going through 
federation. Right now, NM where AM is launched tells to RM about the collector 
info in the heartbeat and then RM sends this to all the NMs where container is 
launched for that application. In case of federation, NM will only inform the 
home subcluster RM but containers for the application can be launched on 
another subclusters also. NMs of other subcluster will never know about the 
collector info of the application and thus will not be able to publish the 
events. We need to figure out some workaround for this problem. One of the 
option is AM telling RM about the container info in AMRM communication. 
Thoughts?

cc [~vrushalic] [~rohithsharma]

> Timeline service v2 integration with Federation 
> 
>
> Key: YARN-5357
> URL: https://issues.apache.org/jira/browse/YARN-5357
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Prabha Manepalli
>Priority: Major
>
> Jira to note the discussion points from an initial chat about integrating 
> Timeline Service v2 with Federation (YARN-2915).
> cc [~subru] [~curino] 
> For Federation:
> - all entities that belong to the same flow run should have the same cluster 
> name
> - app id in the same flow run strongly ordered in time
> - need a logical cluster name and physical cluster name
> - a possibility to implement the Application TimelineCollector as an 
> interceptor in the AMRMProxyService.
> For Timeline Service:
> - need to store physical cluster id and logical cluster id so that we don't 
> lose information at any level (flow/app/entity etc)
> - add a  new table app id to cluster mapping table
> - need a different entity table/some table to store node level metrics for 
> physical cluster stats. Once we get to node-level rollup, we probably have to 
> store something in a dc, cluster, rack, node hierarchy. In that case a 
> physical cluster makes sense, but we'd still need some way to tie physical 
> and logical together in order to make automatic error detection etc that 
> we're envisioning feasible within a federated setup.
> For the Cluster Naming convention:
> - three situations for cluster name:
> > app submitted to router should take federated (aka logical) cluster name
> > app submitted directly to RM should take physical cluster name
> > Info about the physical cluster  in entities?
> - suggestion to set the cluster name as yarn tag at the router level (in the 
> app submission context) 
> Other points to note:
> - for federation to work smoothly in environments that use HDFS some 
> additional considerations are needed, and possibly some solution like what is 
> being used at Twitter with the nFly approach.
> Email thread context:
> {code}
> -- Forwarded message --
> From: Joep Rottinghuis 
> Date: Fri, Jul 8, 2016 at 1:22 PM
> Subject: Re: Federation -Timeline Service meeting notes
> To: Subramaniam Venkatraman Krishnan 
> Cc: Sangjin Lee, Vrushali Channapattan , Carlo Curino
> Thanks for the notes.
> I think that for federation to work smoothly in environments that use HDFS 
> some additional considerations are needed, and possibly some solution like 
> what we're using at Twitter with our nFly approach.
> bq. - need a different entity table/some table to store node level metrics 
> for physical cluster stats
> Once we get to node-level rollup, we probably have to store something in a 
> dc, cluster, rack, node hierarchy. In that case a physical cluster makes 
> sense, but we'd still need some way to tie physical and logical together in 
> order to make automatic error detection etc that we're envisioning feasible 
> within a federated setup.
> Cheers,
> Joep
> On Fri, Jul 8, 2016 at 1:00 PM, Subramaniam Venkatraman Krishnan  wrote:
> Thanks Vrushali for crisply capturing the essential from our rambling 
> discussion J.
>  
> Sangjin, I just want to add one comment to yours – we want to retain the 
> physical cluster name (possibly as a new entity type) so that we don’t lose 
> information & we can cluster level rollups even if they are not efficient.
>  
> Additionally, based on the walkthrough of Federation design:
> · There was general agreement with the proposed approach.
> · There is a possibility to implement the Application 
> TimelineCollector as an interceptor in the AMRMProxyService.
> · Joep raised the concern that it would be better if the RMs 
> obtain the epoch from FederationStateStore. This is not currently in the 
> roadmap 

[jira] [Commented] (YARN-5357) Timeline service v2 integration with Federation

2018-08-03 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568085#comment-16568085
 ] 

Vrushali C commented on YARN-5357:
--

Update:
[~prabham] [~Sushil-K-S] [~abmodi] [~rohithsharma] and I have been discussing 
federation integration with timeline service in the recent community calls.

Here is a summary of our current assumption, understanding and direction: 
- There exists a managed AM which we will refer to as 'original AM' for this 
discussion
- There exist sub cluster ids which are physical cluster ids defined in the 
yarn site xml
- Unmanaged AMs which coordinate containers in other subclusters run in the jvm 
context of the Node Manager which manages the original AM.
- The same app id is used on all sub clusters
- Containers launched on other sub clusters will have different epoch ids. 

Current thought process:
- As per our current understanding, the flow context is initialized once and 
then used by all containers launched as part of that application.
- Say an application is launched on sub Cluster A and runs containers on sub 
clusters A, B and C.
- Entities from containers launched on the sub cluster A will have the cluster 
id as sub cluster A in their row key
- The containers that run on different sub clusters B and C will also have the 
cluster id as sub cluster A.
- The containers will be updated to emit to timeline storage the actual 
physical sub cluster that they run on. For instance, containers that run on sub 
clusters B and C will have sub cluster A in their row key but will store sub 
cluster B (or C as the case maybe) in their info column family. 

This enables entities that belong to one application will have one cluster 
identifier (sub cluster A).  Storing the sub clusters B & C for entities that 
are emitted from containers run on sub clusters B & C enables answering 
federation related queries like, how many containers from this application id 
ran on subCluster B? What were the metrics that are associated with entities 
run on subCluster C? What was the average (or min/max/median) runtimes for 
entities for this application on sub clusters A, B and C? etc.

Adding the physical sub cluster as a column / value in info also helps 
differentiate between application entities that belong to applications that run 
on just one sub cluster versus entities that belong to applications that run on 
multiple sub clusters. For entities that belong to applications that run on 
exactly one subcluster, this field will have just one physical cluster id which 
will be the same as the cluster id in the row key. 

We do need to also store the logical federated cluster name, which will a new 
config variable added in YARN-5358. 








> Timeline service v2 integration with Federation 
> 
>
> Key: YARN-5357
> URL: https://issues.apache.org/jira/browse/YARN-5357
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Prabha Manepalli
>Priority: Major
>
> Jira to note the discussion points from an initial chat about integrating 
> Timeline Service v2 with Federation (YARN-2915).
> cc [~subru] [~curino] 
> For Federation:
> - all entities that belong to the same flow run should have the same cluster 
> name
> - app id in the same flow run strongly ordered in time
> - need a logical cluster name and physical cluster name
> - a possibility to implement the Application TimelineCollector as an 
> interceptor in the AMRMProxyService.
> For Timeline Service:
> - need to store physical cluster id and logical cluster id so that we don't 
> lose information at any level (flow/app/entity etc)
> - add a  new table app id to cluster mapping table
> - need a different entity table/some table to store node level metrics for 
> physical cluster stats. Once we get to node-level rollup, we probably have to 
> store something in a dc, cluster, rack, node hierarchy. In that case a 
> physical cluster makes sense, but we'd still need some way to tie physical 
> and logical together in order to make automatic error detection etc that 
> we're envisioning feasible within a federated setup.
> For the Cluster Naming convention:
> - three situations for cluster name:
> > app submitted to router should take federated (aka logical) cluster name
> > app submitted directly to RM should take physical cluster name
> > Info about the physical cluster  in entities?
> - suggestion to set the cluster name as yarn tag at the router level (in the 
> app submission context) 
> Other points to note:
> - for federation to work smoothly in environments that use HDFS some 
> additional considerations are needed, and possibly some solution like what is 
> being used at Twitter with the nFly approach.
> Email thread co