Li Lu commented on YARN-3901:

Hi [~vrushalic], thanks for the work! I looked at the current patch and have 
the following comments/questions:
- Name of Attribute seems to be quite general. Maybe we want something more 
specific? From my understanding, Attribute acts as the "command" (as the 
meaning in design pattern) of the aggregation? 

- storeInFlowActivityTable, why two levels of indirections? The te parameter is 
never used in the first wrapper. 
- Move the few static helper methods into TimelineEntity? 
- Since both AggregationCompactionDimension and AggregationOperations may 
generate an Attribute, maybe it's helpful to distinguish Attributes from these 
two ways from their names? The name like attribute1 does not look like helpful. 

getIncomingAttributes in TimelineWriterUtils may need some more comments on its 

May conflict with YARN-4102. I'm fine with either order to put them in. 

Are we assuming there will be at most two attributes for each column prefix? In 
FlowScanner we're only dealing with two attributes, one from compaction one 
from operations. But in FlowActivityColumnPrefix we're assuming there's a list 
of attributes? 

Maybe I'm missing something, but why we're converting hbase attributes into 
tags in FlowRunCoprocessor, but not doing the same thing in 
FlowActivityCoprocessor? Or, what does FlowActivityCoprocessor aggregate on?

What is our plan on FlowActivityColumnPrefix#IN_PROGRESS_TIME? 

In FlowScanner, after aggregation (in nextInternal) we're simply adding 
aggregated data as a Cell. However I haven't found where we're guaranteeing the 
new node is not aggregate again (and we create another new cell for the 
aggregation result). Are we doing this deliberately or I'm missing anything 

- l.149, l310, HBaseTimelineWriterImpl
Indentation problems?
- There are some lines are longer than 80. 
- l.64 AggregationOperations, wrong indentation with tab. 
- FlowRunColumnPrefix, FlowActivityColumnPrefix (maybe somewhere else): in 
Hadoop a common practice is to only have one space before the name of member 
variables. We don't really need to make all of them start in the same column. 
- Just curious, how did you choose the numbers associated with different 

It's a big patch so I may find something more tomorrow. Sorry about that but I 
just want to not to block the whole review process. 

> Populate flow run data in the flow_run & flow activity tables
> -------------------------------------------------------------
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay

This message was sent by Atlassian JIRA

Reply via email to