[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

Joep Rottinghuis (JIRA) Mon, 17 Aug 2015 14:23:10 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700255#comment-14700255
 ]


Joep Rottinghuis commented on YARN-3901:
----------------------------------------

After discussing with [~vrushalic] we concluded the following:
- Let's keep tag as an implementation detail in the coprocessor.
- Let's add a Map<String, byte[]> attributes argument to store for columns (and 
column prefixes) in order to pass values along
- Columns themselves know how to add additional attributes, namely the 
operation if needed: MIN, MAX, AGG
- Coprocessor will map these values to tags and store.
- Given that preput is evaluated for multiple items in a batch, reading during 
pre-put will yield incorrect result (even though it appears safe with flush of 
BufferedMutator). Therefore we need to switch to just adding a tag to a cell in 
pre-put and collapse min and max during read (flush and compactions).
- Add an attribute Compact in order to indicate that an app is done (therefore 
separating whether a value can be aggregated or not). Write this only for the 
last write, so that we don't store tags for default/common values and therefore 
keeping storage smaller.
- We don't need TimelineWriterUtils.join
- We don't need TimelineWriterUtils.ONE_IN_BYTES
- Collapse the wip storeWithTags into simply store.
- Coprocessor needs to detect if it is going from one column qualifier to the 
next. The peek method just ensures that the iteration stays within the row. 
Need to sit and think through how to do that most cleanly, perhaps with peek 
being able to show only the same column based on argument?

> Populate flow run data in the flow_run table
> --------------------------------------------
>
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

Reply via email to