[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791012#comment-14791012
 ] 

Junping Du commented on YARN-3816:
----------------------------------

Thanks Naga, Varun and Li for review and comments! Let me address them one by 
one. 
First for Naga's comments:
bq. Following are not completely achieved right? number of containers 
launched/completed/failed, framework specific metrics, e.g. HDFS_BYTES_READ, 
should be aggregated to show details of states in framework level.
We are almost there. number of containers should be an existing info which get 
addressed in YARN-3880. Also, framework specific metrics is another topic and 
we were still discussing different requirements for MapReduce and other apps 
which is out of scope of this JIRA - that's why we have YARN system metrics in 
the title.

bq. In the doc, ApplicationState Table (aggregated from 
AppLevelTimelineCollector​) has Container Aggregate metrics (allocated: 0 
preempted:0 failed: 0 reuse: 0 ) is this req @ AppLevelTimelineCollector​ felt 
it should be only @ aggregated from ​RMTimelineCollector. Also time(start: 
last_modification: avg_execution ) is required as metric? may be i misread the 
table description?
Like said above, YARN-3880 is supposed to track container number metrics. May 
be we can move discussion there?

bq. In the doc aggregation-design-discussion.pdf, you had mentioned that time 
average & max is what will be considered, but in the patch it seems more like 
only SUM is supported neither avg or max, so is sum more imp than the other(or 
am i missing something) ? Also would like to know the significance of this 
measurement as i felt per‐container average more helpful as it can be useful 
for calibrating RM.
We had a previous discussion before and we choose SUM as the first operation to 
support on aggregating metrics. There are definetely other operations that are 
useful that we could add and extend later.

bq. IIUC Based on the current design aggregation seems to be happening @ the 
collector end. in that case do we require 
TimelineWriter.aggregate(TimelineEntity data, TimelineAggregationTrack track) ? 
Is there any idea to push some logic to writer for aggregation?
No. App aggregation is per collector but not per writer as currently we are 
sharing a single writer on NM for all app collector. I would prefer to make 
each collector thread to maintain their own states and calculation.

bq. TimelineAggregationBasis doesnt have value for queue, as this is used in 
TimelineReaderWebServices, inst it required for reader?
If my understanding is correct, queue info is not a must for app entity I 
think. We only require flow info, etc. However, I will do double check on 
reader side for this.

bq. will it be required to accumulate time series data with single value data 
and viceversa ? would accumulation need to be done on same type ? if not some 
real scenarios where it can be possibly happen.
In toAccumulate, we support accumulate time series data on a single value data 
(basis data) because we can assume basis data is always single value data which 
comes from last time accumulation result. If there are scenarios that we want 
accumulated result to be time series data, then we can have a separated method 
to extend later. Make sense?

bq. Would it be better to have set of operation which can be performed in 
TimelineMetric so that accumulateTo automatically detect and accumulate for 
diff operations ? currently it seems like statically set to SUM in 
TimelineCollecor.
We support SUM and REP (replace) already. Like above comments, we can add more 
operations later with more specific requirement.

bq. Currently for each putEntity call in collector we are not only aggregating 
& invoking accumulateTo but also sending it to be written to the writer, but in 
the doc its mentioned that it will cache for 15 seconds and then update right?
No. We were choosing to aggregate and accumulate (can be disabled by 
configuration) immediately like current implementation. The previous concern is 
for performance delay but it sounds unnecessary now. We can rethink on this if 
we meet perf bottleneck for this in future.

bq. Not sure earlier why was pid added for a container cpu and mem usage metric 
and not sure why we are removing it. But seems like for a given container we do 
not req pid to be appended as it will be unique to it. is that the reason we 
are removing it?
Pid is added wrongly previously as this info is useless: The outer side of 
TimelineEnity (container entity) already have container id which make this 
metrics unique enough. And we need metric ID to keep the same type (CPU, 
Memory, etc) for aggregation and accumulation.

bq. do we need to set aggregateTo to true for container metrics(cputotalCore% & 
pmemUsage) to ? also we are currently not capturing vmemUsage do we need to 
capture it?
We choose to record these two metrics only in previous JIRAs (like YARN-3045). 
May be we can keep to follow this and add more metrics later if necessary?

bq. In the Doc its mentioned we are going to split the table "ApplicationState 
table" into 2 It can be split into two tables by aggregated from 
RMTimelineCollector or AppLevelTimelineCollector, is it req?
I think this may not be a must requirement that we have to split into two 
tables. But I suggest we can revisit this in YARN-3880 for putting some info 
from RMTimelineCollector. Here we don't have to worry as all aggregated info 
are from AppLevelTimelineCollector.

bq. yarn.timeline-service.aggregation.accumulation.enabled can have default 
value to be explicitly set as true in yarn-default.xml as per the default value 
in yarn config.
Ok. will do.

bq. in TestTimelineMetric.testAccumulationOnTimelineMetrics assertEquals 
expected value should come as first arg and the actual expression as next. when 
it fails exception msg will come wrong. also unused import in that class
Nice catch! Will fix it in next patch.

bq. 2 static methods of TimelineCollector.aggregateMetrics(TimelineEntities) 
are public are they planned to be used some other class? if not we can make it 
private. Also aggregateMetrics returns a map, can it be a List/Set which would 
suffice for appendAggregatedMetricsToEntities
Sounds good. Let's narrow the visibility and make it a SET instead.  

bq. EntityColumnPrefix.AGGREGATED_METRICS is not used anywhere, is it req?
Will remove it.

bq. Trying to create a setup and test the patch in the cluster, if i come 
across more queries will inform.
Cool. That would helps. Thanks!

I will reply Varun and Li's comments in next segment as now it is pretty long 
enough.

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to