[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

Sangjin Lee (JIRA) Tue, 08 Sep 2015 12:44:36 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735479#comment-14735479
 ]


Sangjin Lee commented on YARN-3901:
-----------------------------------

I think something like the following would work:

{code}
210           long currentMinValue = ((Number) GenericObjectMapper.read(CellUtil
211               .cloneValue(currentMinCell))).longValue();
212           long currentCellValue = ((Number) 
GenericObjectMapper.read(CellUtil
213               .cloneValue(cell))).longValue();
{code}

bq. I am thinking I will need this when the flush/compaction scanner is added 
in. If you'd like, I can move it in as a non-public class for now and then move 
it out if needed.

+1.

bq. I actually needed this in the unit test while checking the 
FlowActivityTable contents, if you want I can take it out and you can add that 
test case in when you add in the RowKey changes?

If it is to help your unit test, it's fine to include it here (as long as it's 
identical to what we have in YARN-4074; that would help my rebasing).

bq. Yeah, I was thinking about that too. Right now, metrics will get their own 
timestamps. For other columns, we'd be using the nanoseconds. I am trying to 
see if we can just use milliseconds.

We do need the timestamps that are generated here to be in nanoseconds as they 
are multiplied by the factor of 1 million in {{TimestampGenerator}}. They 
cannot be converted to milliseconds, or it would defeat the purpose of using 
{{TimestampGenerator}}. The comment was about the concern of always being able 
to distinguish these two types of "timestamps" without confusion.

> Populate flow run data in the flow_run & flow activity tables
> -------------------------------------------------------------
>
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.4.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

Reply via email to