[
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390900#comment-14390900
]
Sangjin Lee commented on YARN-3391:
-----------------------------------
Hi [~djp],
The flow id identifies a distinct flow application that can be run repeatedly
over time. The flow run id identifies one instance (or specific execution) of
that flow. Finally, the flow version keeps track of the changes made to the
flow (e.g. changes to the source code).
Let me give you a concrete example. Suppose you have a pig script you run
repeatedly, named "tracking.pig". The flow id in this case may be
"tracking.pig" (or "[email protected]" to denote the fact that user "alice"
runs this script).
The "tracking.pig" script will be run repeatedly many times. If I run it today,
that specific run may have the flow run id of "1427846400" (timestamp when the
pig script started). If I run it again tomorrow, the run id of that run would
be "1427932800", and so on. Multiple run id's for the same flow id is a series
of runs of the same script.
The flow version identifies changes made to the flow (user application). One
scheme may be to use some kind of a hash of the pig script. Another scheme may
be to use the git commit hash. Or some real versions if the user application
has well-defined versions.
A flow run is *NOT* a subset of YARN apps run inside a flow. A flow is a
template of runs if you will, and a flow run is an actual run instances of that
flow. These are described in some detail in the original design doc in
YARN-2928.
I hope this helps.
> Clearly define flow ID/ flow run / flow version in API and storage
> ------------------------------------------------------------------
>
> Key: YARN-3391
> URL: https://issues.apache.org/jira/browse/YARN-3391
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Zhijie Shen
> Assignee: Zhijie Shen
> Attachments: YARN-3391.1.patch
>
>
> To continue the discussion in YARN-3040, let's figure out the best way to
> describe the flow.
> Some key issues that we need to conclude on:
> - How do we include the flow version in the context so that it gets passed
> into the collector and to the storage eventually?
> - Flow run id should be a number as opposed to a generic string?
> - Default behavior for the flow run id if it is missing (i.e. client did not
> set it)
> - How do we handle flow attributes in case of nested levels of flows?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)