Junping Du commented on YARN-3816:

Thanks [~sjlee0], [~varun_saxena] and Li's comments. I am rebase the patch with 
YARN-4356 and incorporating your comments above. Some quick response for your 
major comments above for more feedback:
bq. It appears that the current code will aggregate metrics from all types of 
entities to the application. This seems problematic to me. The main goal of 
this aggregation is to roll up metrics from individual containers to the 
application. But just by having the same metric id, any entity can have its 
metric aggregated by this (incorrectly). For example, any arbitrary entity can 
simply declare a metric named "MEMORY". By virtue of that, it would get 
aggregated and added to the application-level value. There can be variations of 
this: for example, the same metrics can be reported by the container entity, 
app attempt entity, and so on. Then the values may be aggregated double or 
triple. I think we should ensure strongly that the aggregation happens only 
along the path of YARN container entities to application to prevent these 
accidental cases.
That sounds a reasonable concern here. I agree that we should get rid of 
metrics get messed up between system metrics and application's metrics. 
However, I think our goal here is not just aggregate/accumulate container 
metrics, but also provide aggregation service to applications' metrics (other 
than MR). Isn't it? If so, may be a better way is to aggregate metrcis along 
not only metric name but also its original entity type (so memory metrics for 
ContainerEntity won't be aggregated against memory metrics from Application 
Entity). [~sjlee0], What do you think?

bq. On a semi-related note, what happens if clients send metrics directly at 
the application entity level? We should expect most framework-specific AMs to 
do that. For example, MR AM already has all the job-level counters, and it can 
(and should) report those job-level counters as metrics at the YARN application 
entity. Is that case handled correctly, or will we end up getting incorrect 
values (double counting) in that situation?
That's why we need the api of toAggregate() in TimelineMetric. For metrics that 
get aggregated already (like MR AM's counter), it should set it to false to get 
rid of double counting. Sounds good?

bq. calculating area under the curve along the time dimension, would it be 
useful by itself? Average based on this area under the curve seems more useful.
Yes. Both overall and average values are useful in different stand point. 
Former value can be used to represent how much resources the application 
actually consume that is very useful in billing cloud service, etc. We can 
extend later to more values if we think it worth. Varun, make sense?

bq. There are 3 types of aggregation basis, but only application aggregation 
has its own entity type. How do we represent the result entity of the other 2 
I don't quite understand what's the question here. Li, are u suggesting we 
should remove application aggregation entity type, add flow/queue aggregation 
entity type or keep them consistent?

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>              Labels: yarn-2928-1st-milestone
>         Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, 
> YARN-3816-poc-v2.patch
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).

This message was sent by Atlassian JIRA

Reply via email to