Zhijie Shen commented on YARN-3044:

Before screening the patch details, I have some high level comments:

bq. IIUC you meant we will have RMContainerEntity having type as 
YARN_RM_CONTAINER and NMContainerEntity having type as YARN_NM_CONTAINER right ?

Can we use ContainerEntity. The events from RM are RM_XXXX_EVENT, and those 
from NM are NM_XXXX_EVENT.

bq. I'm very much concerned about the volume of writes that the RM collector 
would need to do,
bq. I fully understand the concern from Sangjin Lee that RM may not afford tens 
of thousands containers in large size cluster.

I also think publishing all container lifecycle events from NM is likely to be 
a big cost in total, but I'd like to provide some point from other point of 
view. Say we have a big cluster that can afford 5,000 concurrent containers. RM 
have to maintain the lifecycle of these 5K containers, and I don't think a less 
powerful server can manage it, right? Assume we have such a powerful server to 
run the RM of a big cluster, will publishing lifecycle events be a big deal to 
the server? I'm not sure, but I can provide some hints. Now each container will 
write 2 events per lifecycle,  and perhaps in the future we want to record each 
state transition, and result in ~10 events per lifecycle. Therefore, we have 10 
* 5K  lifecycle events, and they won't be written at the same moment because 
containers' lifecycles are usually async. Let's assume each container run for 
1h  and lifecycle events are uniformly distributed, in each second, there will 
just be around 14 concurrent writes (for a powerful server).

I think we may overestimate the performance impact of writing NM lifecycles. 
Perhaps a more reasonable performance metric is {{cost of writing lifecycle 
events per container / cost of managing lifecycle per container * 100%}}. For 
example, if it is 2%, I guess it will probably be acceptable.

bq. all configs will not be set as part of this so was there more planned for 
this from the framework side or each application needs to take care of this on 
their own to populate configuration information ?
bq. In that sense, how about letting frameworks (namely AMs) write the 
configuration instead of RM?

I'm not sure if I understand this part correctly, but I incline that system 
timeline data (RM/NM) is controlled by cluster config and per cluster, while 
application data is controlled by framework or even per-application config. It 
may have some problem if the user is able to change the former config. For 
example, he can hide its application information from cluster admin.

bq. I have also incorporated the changes to support RMContainer metrics based 
on configuration (Junping's comments).

Do you mean we should keep 
{{yarn.resourcemanager.system-metrics-publisher.enabled}} to control RM SMP, 
and and create {{yarn.nodemanager.system-metrics-publisher.enabled}} to control 

> [Event producers] Implement RM writing app lifecycle events to ATS
> ------------------------------------------------------------------
>                 Key: YARN-3044
>                 URL: https://issues.apache.org/jira/browse/YARN-3044
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Naganarasimha G R
>         Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch
> Per design in YARN-2928, implement RM writing app lifecycle events to ATS.

This message was sent by Atlassian JIRA

Reply via email to