[
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876928#comment-14876928
]
Sangjin Lee commented on YARN-3816:
-----------------------------------
My apologies for truly belated review comments. I just had time to go over this
in some depth after working on YARN-4074. I think the latest patch is much more
aligned with the overall design, and thanks much for working on that patiently
[~djp].
First off, this overlaps with YARN-4074 and YARN-4075 that are getting wrapped
up. So it would be good if this goes in after those 2 JIRAs. Let me know if
you're OK with that.
Also, I do have some basic questions and issues to discuss, and I'll mention
them here. But I'm comfortable with having follow-on JIRAs after this one to
address some of these that turn out to be major changes.
*(aggregating metrics from all types of entities to application)*
It appears that the current code will aggregate metrics from all types of
entities to the application. This seems problematic to me.
The main goal of this aggregation is to roll up metrics from individual
*containers* to the application. But just by having the same metric id, any
entity can have its metric aggregated by this (incorrectly). For example, any
arbitrary entity can simply declare a metric named "MEMORY". By virtue of that,
it would get aggregated and added to the application-level value. There can be
variations of this: for example, the same metrics can be reported by the
container entity, app attempt entity, and so on. Then the values may be
aggregated double or triple.
I think we should ensure strongly that the aggregation happens only along the
path of YARN container entities to application to prevent these accidental
cases.
On a semi-related note, what happens if clients send metrics directly at the
application entity level? We should expect most framework-specific AMs to do
that. For example, MR AM already has all the job-level counters, and it can
(and should) report those job-level counters as metrics at the YARN application
entity. Is that case handled correctly, or will we end up getting incorrect
values (double counting) in that situation?
On to individual files:
(TimelineMetric.java)
- l.122: Although the method name is {{accumulateTo()}}, most of the variables
and comments say "aggregate". Can we clean them up to say "accumulate"?
(TimelineMetricCalculator.java)
- we should add the annotations (public? unstable?)
- l.34: if {{n1 == null}}, shouldn't we return {{-n2}}?
- for both {{sub()}} and {{sum()}}, would it be simpler just to handle the
arithmetic as longs even if they're integers?
(yarn-default.xml)
- The default defined in YarnConfiguration is true, but in yarn-default.xml it
is false; which is correct? We should reconcile them.
(NMTimelinePublisher.java)
- Shouldn't these metrics set {{toAggregate}} to true (because the default is
false)? These metrics are *THE* main ones we want to aggregate from containers
to application, right? For that matter, should the default itself for
{{toAggregate}} on {{TimelineMetric}} be true? I feel we should aggregate
unless specified otherwise, not the other way around. Thoughts?
(TimelineCollector.java)
- l.124: nit: you can simply call {{aggregateMetrics()}} instead of
{{TimelineCollector.aggregateMetrics()}}
- l.130: the same for {{appendAggregatedMetricsToEntities()}}
- l.212: What is the point of nulling out the value for metric id in
{{perIdAggregatedNum}}? It doesn't seem necessary.
(TimelineReaderWebServices.java)
- I'm not so sure if we need a separate REST end point for "aggregates". If I
understand correctly, they are all stored in the same application table under
the same app id. What does it mean to have a separate REST URL for aggregates?
Can we query for the application and be done?
(HBaseTimelineWriterImpl.java)
- I see that you're appending the {{toAggregate flag}} to the column name. I
think it is fine for now, but we will need to look at this again, as there are
other dimensions of metrics that need to be persisted. Some examples include
single value v. time series, long v. float (possibly), and so on. We will need
to arrive at a conclusion on how to encode them all cleanly and efficiently. We
can address this later together with [~varun_saxena] as he's dealing with a
related JIRA.
(HBaseTimelineReaderImpl.java)
- l.506: nit: it can just be
{code}
boolean toAggregate = toAggregateStr.equals("1");
{code}
> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Junping Du
> Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf,
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch,
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch,
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch,
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch,
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include:
> resource (CPU, Memory) consumption across all containers, number of
> containers launched/completed/failed, etc. We need this for apps while they
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based
> on Application-level aggregations rather than raw entity-level data as much
> less raws need to scan (with filter out non-aggregated entities, like:
> events, configurations, etc.).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)