[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714416#comment-14714416
 ] 

Junping Du commented on YARN-3816:
----------------------------------

Thanks [~varun_saxena] for review and comments!
bq. If we use same scheme for long or double, we may end up with 4 ORs' for a 
single metric. Maybe we can use cell tags for aggregation.
That's good point! When I was doing poc patch a few weeks ago, YARN-4053 
haven't been bring out to discussion so I thought it was a little overkill to 
use cell tag for specifying the only boolean value. Now it seems to be a good 
way, but I would prefer to defer this decision to YARN-4053 to address while 
there are other priority comments to address here so we can move faster. What 
do you think?

bq. Maybe in TimelineCollector#aggregateMetrics, we should do aggregation only 
if the flag is enabled.
That's true. That's part of reason why aggregation flag is added to metric. 
Will add check in next patch.

bq. In TimelineCollector#appendAggregatedMetricsToEntities any reason we are 
creating separate TimelineEntity objects for each metric ? Maybe create a 
single entity containing a set of metrics.
Nice catch.

bq. 3 new maps have been introduced in TimelineCollector and these are used as 
base to calculate aggregated value. What if the daemon crashes?
For RM, it could persistent maps to RMStateStore. For NM, it may not be enough 
as NM could be lost also. We need a mechanism that if TimelineCollector is 
relaunched somewhere else, it will read raw metrics and recover the maps before 
start to working. This will be part of failed over JIRAs like: YARN-3115, 
YARN-3359, etc.

bq. In TimelineMetricCalculator some functions have duplicate if conditions for 
long.
Fixed.

bq. In TimelineMetricCalculator#sum, to avoid negative values due to overflow, 
we can change conditions like below...
Like above comments, the overflow case will be handled in next patch.

bq. In TimelineMetric#aggregateTo, maybe use getValues instead of getValuesJAXB?
I would prefer to use TreeMap because it sort key (timestamp) when accessing 
it. aggregateTo() algorithm assume metrics are sorted by timestamp.

bq. Also I was wondering if TimelineMetric#aggregateTo should be moved to some 
util class. TimelineMetric is part of object model and exposed to client. And 
IIUC aggregateTo wont be called by client.
As Li's mentioning below, it is a bit tricky to have utility class for any 
classes in API, because it would mislead user to use it which is not our 
intension, at least for now. aggregateTo is not straighfoward and generic 
useful like methods in TimelineMetricCalculator, so let's hold on to expose it 
as utility class for now. Make it static sounds good though.

bq. What is EntityColumnPrefix#AGGREGATED_METRICS meant for?
It is something developed at poc stage a few weeks ago, and it should be 
removed after we moving to ApplicationTable.

> [Aggregation] App-level Aggregation for YARN system metrics
> -----------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to