flow version in API and storage

Sangjin Lee (JIRA) Wed, 01 Apr 2015 09:12:17 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390900#comment-14390900
 ]


Sangjin Lee commented on YARN-3391:
-----------------------------------

Hi [~djp],

The flow id identifies a distinct flow application that can be run repeatedly 
over time. The flow run id identifies one instance (or specific execution) of 
that flow. Finally, the flow version keeps track of the changes made to the 
flow (e.g. changes to the source code).

Let me give you a concrete example. Suppose you have a pig script you run 
repeatedly, named "tracking.pig". The flow id in this case may be 
"tracking.pig" (or "[email protected]" to denote the fact that user "alice" 
runs this script).

The "tracking.pig" script will be run repeatedly many times. If I run it today, 
that specific run may have the flow run id of "1427846400" (timestamp when the 
pig script started). If I run it again tomorrow, the run id of that run would 
be "1427932800", and so on. Multiple run id's for the same flow id is a series 
of runs of the same script.

The flow version identifies changes made to the flow (user application). One 
scheme may be to use some kind of a hash of the pig script. Another scheme may 
be to use the git commit hash. Or some real versions if the user application 
has well-defined versions.

A flow run is *NOT* a subset of YARN apps run inside a flow. A flow is a 
template of runs if you will, and a flow run is an actual run instances of that 
flow. These are described in some detail in the original design doc in 
YARN-2928.

I hope this helps.

> Clearly define flow ID/ flow run / flow version in API and storage
> ------------------------------------------------------------------
>
>                 Key: YARN-3391
>                 URL: https://issues.apache.org/jira/browse/YARN-3391
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>         Attachments: YARN-3391.1.patch
>
>
> To continue the discussion in YARN-3040, let's figure out the best way to 
> describe the flow.
> Some key issues that we need to conclude on:
> - How do we include the flow version in the context so that it gets passed 
> into the collector and to the storage eventually?
> - Flow run id should be a number as opposed to a generic string?
> - Default behavior for the flow run id if it is missing (i.e. client did not 
> set it)
> - How do we handle flow attributes in case of nested levels of flows?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage

Reply via email to