Li Lu updated YARN-3817:
    Attachment: YARN-3817-poc-v1.patch

I'm attaching the first POC patch of our Phoenix based offline aggregator. The 
current patch adds a mapreduce based offline aggregator that will gather 
information from our HBase storage, perform the flow and user based 
aggregation, and writes aggregated data back to Phoenix. Generally, the 
expected input to the offline aggregator is a list of flows (active flow of the 
past time period, or a specially created list of flows within a given time 
window). The offline aggregator will firstly aggregate all flow run data for 
each flow in both the mapper and the reducer, then write them back into 
Phoenix. Meanwhile, the aggregated data is passed alone to the user level 
aggregation. The user level aggregation performs similar aggregations as the 
flow aggregations. There is a TimelineEntityWritable class to transfer 

Some TODOs:
1. Centralize some of the HBase reader related code for both the aggregation 
hbase reader and the hbase reader. 
2. Create a "trigger" to launch the aggregator in a timely or ad-hoc fashion. 
3. Separate configs. 
4. Support aggregation on a specific time period. 
5. More tests. 

Future TODOs: 
Reorganize our storage package and unit tests

Some extra work performed in this patch:
1. No longer storing info fields in Phoenix writer. 
2. Escaping special characters in Phoenix writer by quoting all column names 
(according to Phoenix team's suggestion). 
3. Centralizing tests for aggregation and Phoenix. 
4. Remove unused TestTimelineWriterUtil. 

> [Aggregation] Flow and User level aggregation on Application States table
> -------------------------------------------------------------------------
>                 Key: YARN-3817
>                 URL: https://issues.apache.org/jira/browse/YARN-3817
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Li Lu
>         Attachments: Detail Design for Flow and User Level Aggregation.pdf, 
> YARN-3817-poc-v1.patch
> We need time-based flow/user level aggregation to present flow/user related 
> states to end users.
> Flow level represents summary info of a specific flow. User level aggregation 
> represents summary info of a specific user, it should include summary info of 
> accumulated and statistic means (by two levels: application and flow), like: 
> number of Flows, applications, resource consumption, resource means per app 
> or flow, etc. 

This message was sent by Atlassian JIRA

Reply via email to