[
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635513#comment-14635513
]
Junping Du commented on YARN-3816:
----------------------------------
Thanks [~sjlee0] for review and comments!
bq. If I understand correctly, this patch basically does a time integral of a
given metric, or "the area under the curve" for the metric as a function of
time. For example, if the underlying metric is a container CPU usage, the
"aggregated" metric according to TimelineMetric.aggregateTo() would be a
cumulative CPU usage over time for that container (in the units of CPU-millis).
That's correct. As a poc patch for app aggregation, we only pick up some metric
to aggregated in some way to demonstrate overall end-to-end flow. I understand
there could be more important aggregated metrics there and I will try to add
more in following patches.
bq. While this is certainly a useful number to keep track of, this was not the
app-level aggregation I had in mind. IMO, the app-level aggregation (or any
aggregation for that matter) is all about rolling metrics up from child
entities to the parent entity. I would have thought that it would be the first
thing we want to get to. It looks, however, as though that aggregation is not
done in this patch. I don't see any code that rolls up values from containers
to the application. Are you planning to introduce that soon?
Yes. I should add that part in poc v2 patch for taking a "snapshot" for
resource consumption on an application. Previous area value is also kept for
different purpose (resource billing/charge, etc.).
bq. This type of time integral works only if the underlying metric is a gauge.
For example, for any counter-like metric (e.g. HDFS bytes read) which is
cumulative in nature, the time integral does not make sense. We will need to
introduce another type dimension to the metrics that signifies whether it is a
counter or a gauge, but this is just to note that the time integral works only
for gauges.
I agree that we should differentiate counter with gauge. For the previous one,
we are more focus on its cumulative property while the later one is more focus
on "snapshot". However, in practice, there are cases that some aggregated
metrics has both properties, like "area" value here - we do need its cumulative
values and also could be interested in getting values within a given time
interval. Isn't it?
bq. Also, this is pretty similar to what we talked about during the offline
meeting as "average/max" for gauges, except that it's not divided over time. We
discussed that we want to introduce time averages and maxes for gauges (see
"time average & max" in
https://issues.apache.org/jira/secure/attachment/12743390/aggregation-design-discussion.pdf).
Are we thinking of replacing that with this?
No. Nothing get changed on the design since our last discussions. The average
and max is also important but I just haven't get bandwidth to add in poc stage
as adding existing things could be more straight-forward. I will add it later.
bq. In the specific case of container CPU usage, it seems to me that emitting
the actual CPU time millis directly would be a far easier and more accurate way
to capture this info. I believe it's readily available, and it would be a
counter-like metric instead of a gauge. Therefore the time integral doesn't
apply (as it already is one). But all you need to do at the app-level
aggregation for it is just to sum it up. I recognize that this time integral
would be useful for other things, but just wanted to point that out.
Thanks for pointing that out. I agree this is more precisely and will update
this in following patch.
> [Aggregation] App-level Aggregation for YARN system metrics
> -----------------------------------------------------------
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Junping Du
> Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf,
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include:
> resource (CPU, Memory) consumption across all containers, number of
> containers launched/completed/failed, etc. We need this for apps while they
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based
> on Application-level aggregations rather than raw entity-level data as much
> less raws need to scan (with filter out non-aggregated entities, like:
> events, configurations, etc.).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)