Li Lu commented on YARN-3817:

I did some rough estimation on the resource consumption for the flow and user 
level time-based aggregation. Suppose our aggregation interval is one hour. For 
active large clusters (a few k active flows in one hour), we may generate a few 
k timeline entity reads to the aggregated application table. Metrics will take 
the majority of storage space. Each application may have <100 metrics (system 
metrics and customized metrics), so each aggregated entity may take ~50k space 
(400 bytes for metric name and a few kbs for the data). So in total we may 
generate say 5k*50k = 250M of read traffic, and write back 25M-250M of 
aggregated data (depends on the granularity of flows) for each flow. 

Similarly, if we assume a few hundreds of cluster users, we're generating 
similar scale of traffic. 

One risk of using HBase coprocessor is they're running with the region servers, 
so once there are failures the region server is down. Given the fact that we're 
planning to scale timeline v2 to more than one cluster, the traffic generated 
by time-based aggregation may easily increase 10 times in future. This said, we 
may want to try to implement the offline aggregations as map-reduce jobs as our 
first attempt. Afterwards, if there are needs to implement aggregation in 
endpoint coprocessors, we can easily reuse the "core" part of the mapreduce 

> [Aggregation] Flow and User level aggregation on Application States table
> -------------------------------------------------------------------------
>                 Key: YARN-3817
>                 URL: https://issues.apache.org/jira/browse/YARN-3817
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Detail Design for Flow and User Level Aggregation.pdf
> We need flow/user level aggregation to present flow/user related states to 
> end users.
> Flow level aggregation involve three levels aggregations:
> - The first level is Flow_run level which represents one execution of a flow 
> and shows exactly aggregated data for a run of flow.
> - The 2nd level is Flow_version level which represents summary info of a 
> version of flow.
> - The 3rd level is Flow level which represents summary info of a specific 
> flow.
> User level aggregation represents summary info of a specific user, it should 
> include summary info of accumulated and statistic means (by two levels: 
> application and flow), like: number of Flows, applications, resource 
> consumption, resource means per app or flow, etc. 

This message was sent by Atlassian JIRA

Reply via email to