[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593890#comment-14593890
 ] 

Joep Rottinghuis commented on YARN-3815:
----------------------------------------

For flow-level aggregates I'll separately write up ideas about how to do that.
In short we need to focus on write performance, plus the fact that we have to 
deal with the need to aggregate increments to aggregates from running 
applications. This makes it tricky to do correctly, specifically when apps (and 
ATS writers) can crash and need to restart. We'll have to keep track of the 
last values written. Initially I thought that using a coprocessor to do this 
server side solves the problem. The challenge is that it will be invoked in the 
write-path of individual stats, so slow writes to a second region server 
(hosting the agg table/row) can have a rippling affect on many writes. Even 
worse, we can end up with a deadlock situation under load conditions when the 
agg table/row happens to be hosted on the same region server and the current 
write is blocked on the completion of coprocessor which needs to write but is 
blocked on a full queue on its own region server.

It think the solution will be to do something in the spirit of readless 
increments as used in Tephra. Similarly we'd collapse values only when flushes 
or compactions happen, and then aggregation is restricted to a single row which 
is locked without issues. On reads we collapse the pre-aggregated values plus 
the values from currently running jobs. The significant difference will be that 
we can compact only when jobs are complete. I'll try to write up a more 
detailed design for this.

If we follow [~sjlee0]'s suggestion to make all the other aggregates periodic, 
then we can use mapreduce for those. The big advantage is that we can then use 
control records like we do in hRaven to efficiently keep track of what we have 
already aggregated. The tricky ones will be the long running ones we have to 
keep getting back to. Ideally we should be able to read the raw values once and 
then "spray" they out to the various aggregate tables (cluster, queue, user) 
per time period. Otherwise we end up scanning over the raw values over and over 
again.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> ------------------------------------------------------------
>
>                 Key: YARN-3815
>                 URL: https://issues.apache.org/jira/browse/YARN-3815
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to