[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

Joep Rottinghuis (JIRA) Mon, 14 Sep 2015 14:37:08 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744320#comment-14744320
 ]


Joep Rottinghuis commented on YARN-3901:
----------------------------------------

[~sjlee0] 
bq.  I remember Joep Rottinghuis mentioning that an unset timestamp is 
equivalent to Cell.getTimestamp() returning Long.MAX_VALUE. Joep Rottinghuis?

Yes, if you see the org.apache.hadoop.hbase.client.Put#addColumn method without 
a timestamp, it uses the timestamp from the Put. There are two constructors for 
a Put, one with the timestamp, one without. The one without uses 
HConstants.LATEST_TIMESTAMP which is defined as:
{code}
  /**
   * Timestamp to use when we want to refer to the latest cell.
   * This is the timestamp sent by clients when no timestamp is specified on
   * commit.
   */
  public static final long LATEST_TIMESTAMP = Long.MAX_VALUE;
{code}

That is then used (indirectly through LATEST_TIMESTAMP_BYTES) in the KeyValue 
class in the #isLatestTimestamp method, which in turn is used in the 
KeyValue#updateLatestStamp that sets it to "now" on the server side.

I'm not 100% sure (we need to test this) but I'm assuming that the 
transformation of this isLatestTimestamp happens after coprocessors or never at 
all (the cells might just be written with the latest timestamp, and one might 
not have the ability to ask what the row looked like at any particular time at 
all). I thought this might be overwritten on the server side, but can't find 
that code now.

> Populate flow run data in the flow_run & flow activity tables
> -------------------------------------------------------------
>
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, 
> YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

Reply via email to