[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729511#comment-14729511
 ] 

Vrushali C commented on YARN-3901:
----------------------------------


Thanks [~gtCarrera9] for the review. Let me try to give some explanations to 
your questions above. 

bq. Name of Attribute seems to be quite general. Maybe we want something more 
specific? From my understanding, Attribute acts as the "command" (as the 
meaning in design pattern) of the aggregation?

Yes, attribute indicates what action needs to be taken in the aggregation 
step/reading step. What name do you recommend? 

bq. TimelineSchemaCreator May conflict with YARN-4102. I'm fine with either 
order to put them in.
Yes, I thought about that, but the patch in YARN-4102 is not committed yet, so 
could not rebase. I am good with rebasing the YARN-3901 patch if YARN-4102 gets 
in.

bq. Are we assuming there will be at most two attributes for each column 
prefix? In FlowScanner we're only dealing with two attributes, one from 
compaction one from operations. But in FlowActivityColumnPrefix we're assuming 
there's a list of attributes?
No, there can be any number of attributes for a column prefix. Currently MIN, 
MAX and SUM happen to be exclusive in the sense, if you want a min for start 
time, it's unlikely that you want to be SUMing  up the start times. FlowScanner 
looks for application id from the aggregation compaction dimensions and for the 
MIN/MAX/SUM from the aggregation operations. This FlowScanner class is very 
different from FlowActivityColumnPrefix. The FlowActivityColumnPrefix or any 
class that generates a Put will deal with a list of attributes. 

bq. What is our plan on FlowActivityColumnPrefix#IN_PROGRESS_TIME? 
Yes, this is timestamp that needs to be put into the flow activity table for 
all (long) running applications. If an application in a flow starts on say Day1 
and runs through day 2, day 3 and ends on day4, then the flow activity table 
needs to have an entry for this flow for day2 and day3. This is the in progress 
time of that application, it is the TBD part being thought over in YARN-4069. 
We need to think if we want the RM to write it, or the App master or something 
else offline. 

bq. In FlowScanner, after aggregation (in nextInternal) we're simply adding 
aggregated data as a Cell. However I haven't found where we're guaranteeing the 
new node is not aggregate again (and we create another new cell for the 
aggregation result). Are we doing this deliberately or I'm missing anything 
here?
Hmm, not sure I got the question but let me try to explain what the FlowScanner 
should be doing. It will read each cell one by one. Say for start time column, 
it reads the cells. Now for a flow, we want that value which is the lowest for 
the start time of the flow. Hence these cells have a tag of MIN. So, the 
nextInternal will return one cell with the min value for the column start time. 
Similarly for max and for SUM, it sums up the cell values. Hope this helps.

Also, I will double check the formatting related comments and update as 
necessary. Appreciate the review! 



> Populate flow run data in the flow_run & flow activity tables
> -------------------------------------------------------------
>
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to