Sangjin Lee commented on YARN-3040:

[~zjshen], thanks for your updated patch and prompt answers! I'll go over the 
new patch in some more detail, and get back to you. I haven't looked at the 
patch just yet, and therefore I might be saying something dumb, but I thought 
I'd reply to some of your points. Hopefully this will move things forward.

bq. RM will have all the above context info. When constructing and starting RM 
collector, we should make sure it be setup.
Since RM's collector will handle multiple applications, there is no one-to-one 
relationship between flow/flow-run/app and an instance of the RM collector. RM 
will just have to retain that information in memory for multiple apps, and pass 
that along on a per-call basis to the storage.

bq. Personally, I prefer to user ID to be uniform among the all the context 
properties. ID indicates it can be used to identify a flow.
I'm OK with "flow id" if it increases consistency.

bq. I thought version is part of flow id. I think we can revisit it once the 
schema is done, and we finalized the generic description about the flow 
structure and the notation. So far I'd like to keep it as what it is now. 
Hmm, I didn't think the version as part of the flow id. Here we're thinking bit 
ahead to the storage and query aspects of it, but it's perfectly feasible to 
ask questions like "give me the latest 10 runs of the flow named 'foo.pig'". 
Note that those latest 10 runs can have different versions. This implies there 
needs to be a semantic differentiation between the flow id (name) and the flow 
version. Namely, in this query the flow version is *not* used to retrieve the 
last 10 runs. So I would advocate having a separate field/attribute named "flow 
version" from "flow id".

As for the run id being numeric, as Li alluded to it, there is a significant 
advantage in having run id's as numbers (longs really) as it lends itself to 
super-easy sorting. It's a little bit of storage concern leaking to the higher 
level abstraction, but it's a strong reason to qualify it as a number IMO.

bq. It makes sense, but when RM restarts we use the new start time of RM to 
identify the app instead of the one before. In current way, cluster_xyz will 
contain the application_xyz_123. This was my rationale before. And this default 
cluster id construction is only used in the case the user didn't specify the 
cluster id in config file. In production, user should specify one. I'll thought 
about the question again.
I'm still not sure why it would make sense to have different logical cluster 
id's every time the RM/cluster restarts. Logically, a single cluster should be 
identified by a long-lived name. For example, UIs will be built on questions 
like "give me top 10 flows on cluster ABC". Queries like that surely wouldn't 
care about cluster restarts.

As for the default value, in fact I would imagine most use cases would not set 
the cluster id (just assuming the cluster default would be filled in). That 
would be the norm, not the exception.

Hope these help...

> [Data Model] Make putEntities operation be aware of the app's context
> ---------------------------------------------------------------------
>                 Key: YARN-3040
>                 URL: https://issues.apache.org/jira/browse/YARN-3040
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Zhijie Shen
>         Attachments: YARN-3040.1.patch, YARN-3040.2.patch
> Per design in YARN-2928, implement client-side API for handling *flows*. 
> Frameworks should be able to define and pass in all attributes of flows and 
> flow runs to YARN, and they should be passed into ATS writers.
> YARN tags were discussed as a way to handle this piece of information.

This message was sent by Atlassian JIRA

Reply via email to