[
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643337#comment-14643337
]
Sangjin Lee commented on YARN-3816:
-----------------------------------
Thanks [~djp] for updating the POC patch and providing answers to the questions
I had. I've looked over the new patch and also gone through your answers. Some
follow-up thoughts and observations are below.
(1)
I think there is some confusion on the types of metrics in relation to this.
Here is how I look at the metric types. See if it squares with your
understanding. There are basically *2 independent* dimensions of metric types:
- single value vs. time series
- counter vs. gauge
Single value vs. time series purely concerns *storage*. It only determines
whether only the latest value is stored or the entire time series values are
stored (subject to TTL).
On the other hand, the counter vs. gauge dimension deals with *what type of
mathematical functions/operations* apply to them. Counters are metrics that are
time-cumulative in their nature, and are always monotonically increasing with
time (e.g. HDFS bytes written). Gauges can fluctuate up and down over time
(e.g. CPU usage). The time integral that's being done in this patch applies
only to gauges. It does not make sense for counters.
These are two independent dimensions in principle. For example, a gauge can be
a single value. A counter can be a time series. Regardless of whether they are
always useful, they are possible in principle.
I propose to introduce the second dimension to the metrics explicitly. This
second dimension nearly maps to "toAggregate" (and/or the REP/SUM distinction)
in your patch. But I think it's probably better to introduce the metric types
explicitly as another enum or by subclassing {{TimelineMetric}}. Let me know
what you think.
(2)
I'm still very confused by the usage of the word "aggregate". In this patch,
"aggregate" really means accumulating values of a metric along the time
dimension, which is completely different than the notion of aggregation we have
used all along. The aggregation has always been about rolling up values from
children to parents. Can we choose a different word to describe this aspect of
accumulating values along the time dimension, and avoid using "aggregation" for
this? "Accumulate"? "Cumulative"? Any suggestion?
On a related note,
{quote}
However, in practice, there are cases that some aggregated metrics has both
properties, like "area" value here - we do need its cumulative values and also
could be interested in getting values within a given time interval. Isn't it?
{quote}
My statement was that a time-integral (or accumulation along the time
dimension) does not make sense for counters. For example, consider HDFS bytes
written. The time accumulation is already built into it (see (1)). If you
further accumulate this along the time dimension, it becomes quadratic (doubly
integrated) in time. I don't see how that can be useful. Another way to see
this is that a counter is basically a time integral of another gauge. For
example, the HDFS bytes written counter (in the unit of bytes) is a time
integral of HDFS bytes written per time (in the unit of bytes/sec). If I
misunderstood what you meant, could you kindly clarify it?
(3)
{quote}
No. Nothing get changed on the design since our last discussions. The average
and max is also important but I just haven't get bandwidth to add in poc stage
as adding existing things could be more straight-forward. I will add it later.
{quote}
The average/max we discussed in the offline discussion is actually very similar
to the aggregated (accumulated) metrics here. The only difference is that the
average is further divided by the duration. Otherwise, it's basically the same
derived property. It would be good to do one or the other, but not both. I
would suggest that we do only one of them. I think it would be OK to do this
and not the average/max of the previous discussion. I'd like to hear what
others think about this.
(4)
Can we introduce a configuration that disables this time accumulation feature?
As we discussed, some may not want to have this feature enabled and are
perfectly happy with simple aggregation (from children to parents). It would be
good to isolate this part and be able to enable/disable it.
> [Aggregation] App-level Aggregation for YARN system metrics
> -----------------------------------------------------------
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Junping Du
> Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf,
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include:
> resource (CPU, Memory) consumption across all containers, number of
> containers launched/completed/failed, etc. We need this for apps while they
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based
> on Application-level aggregations rather than raw entity-level data as much
> less raws need to scan (with filter out non-aggregated entities, like:
> events, configurations, etc.).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)