[
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372412#comment-14372412
]
Sangjin Lee commented on YARN-3040:
-----------------------------------
[~zjshen], thanks for your updated patch and prompt answers! I'll go over the
new patch in some more detail, and get back to you. I haven't looked at the
patch just yet, and therefore I might be saying something dumb, but I thought
I'd reply to some of your points. Hopefully this will move things forward.
bq. RM will have all the above context info. When constructing and starting RM
collector, we should make sure it be setup.
Since RM's collector will handle multiple applications, there is no one-to-one
relationship between flow/flow-run/app and an instance of the RM collector. RM
will just have to retain that information in memory for multiple apps, and pass
that along on a per-call basis to the storage.
bq. Personally, I prefer to user ID to be uniform among the all the context
properties. ID indicates it can be used to identify a flow.
I'm OK with "flow id" if it increases consistency.
bq. I thought version is part of flow id. I think we can revisit it once the
schema is done, and we finalized the generic description about the flow
structure and the notation. So far I'd like to keep it as what it is now.
Thoughts?
Hmm, I didn't think the version as part of the flow id. Here we're thinking bit
ahead to the storage and query aspects of it, but it's perfectly feasible to
ask questions like "give me the latest 10 runs of the flow named 'foo.pig'".
Note that those latest 10 runs can have different versions. This implies there
needs to be a semantic differentiation between the flow id (name) and the flow
version. Namely, in this query the flow version is *not* used to retrieve the
last 10 runs. So I would advocate having a separate field/attribute named "flow
version" from "flow id".
As for the run id being numeric, as Li alluded to it, there is a significant
advantage in having run id's as numbers (longs really) as it lends itself to
super-easy sorting. It's a little bit of storage concern leaking to the higher
level abstraction, but it's a strong reason to qualify it as a number IMO.
bq. It makes sense, but when RM restarts we use the new start time of RM to
identify the app instead of the one before. In current way, cluster_xyz will
contain the application_xyz_123. This was my rationale before. And this default
cluster id construction is only used in the case the user didn't specify the
cluster id in config file. In production, user should specify one. I'll thought
about the question again.
I'm still not sure why it would make sense to have different logical cluster
id's every time the RM/cluster restarts. Logically, a single cluster should be
identified by a long-lived name. For example, UIs will be built on questions
like "give me top 10 flows on cluster ABC". Queries like that surely wouldn't
care about cluster restarts.
As for the default value, in fact I would imagine most use cases would not set
the cluster id (just assuming the cluster default would be filled in). That
would be the norm, not the exception.
Hope these help...
> [Data Model] Make putEntities operation be aware of the app's context
> ---------------------------------------------------------------------
>
> Key: YARN-3040
> URL: https://issues.apache.org/jira/browse/YARN-3040
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Sangjin Lee
> Assignee: Zhijie Shen
> Attachments: YARN-3040.1.patch, YARN-3040.2.patch
>
>
> Per design in YARN-2928, implement client-side API for handling *flows*.
> Frameworks should be able to define and pass in all attributes of flows and
> flow runs to YARN, and they should be passed into ATS writers.
> YARN tags were discussed as a way to handle this piece of information.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)