[ 
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555104#comment-14555104
 ] 

Joep Rottinghuis commented on YARN-3699:
----------------------------------------

The question really seems to be whether version is an identifying feature of a 
flow.
By and large the version increases linearly (in hRaven I think we accomplished 
that by having a version table with the version identifying, ie a hash of the 
Pig script or a hash of the class that runs the scalding code. Then the 
timestamp that we first see a new version becomes the version identifier).

In any case, we can ask give me the last 10 runs of this flow. Often you don't 
care about the versions. Leaving the version out of the rowkey, that will allow 
a quick linear range scan over 10 values.
If version is part of the key, and if it is possible for two versions to run 
overlapping (user submits a run of the old version, and then a newer version, 
then the older version again. More complex scenarios are possible), then one 
cannot do a quick range scan over just the 10 runs starting with the latest.
Once has to scan over all the runs up to the next flow. In theory it is 
possible that any version still had a run.

The reverse is true if one makes version a column. In that case getting the 
last 3 runs of a version becomes more expensive. In hRaven the usecase for this 
is when we want to do reducer estimation. We want to query back the last 3 runs 
of a flow and compare inputs and outputs of the various jobs in a flow. 
However, if one changes the Pig script or Scalding code then the DAG changes 
and one can no longer safely use inputs and outputs from those previous runs 
(with a different version). In theory we could scan the entire flow to see if 
there were ever any runs for this version. Particularly for the reducer 
estimation this is not needed. We can simply add a limit and say, give me only 
the last 3 runs in the last week. Even if we set a limit of let's say 100 runs 
that is fine. The query then becomes, give me the last 3 runs of this flow of 
this specific version out of the last 100 runs (of any version).
We would error on the side of returning fewer runs. Specifically for the 
reducer estimation case, that is essentially no worse than the first three runs 
of any flow version, so we simply fall back on the sampling based estimator or 
a size based heuristic.
Even if we do have to get an accurate answer, this still is a larger range scan 
over all the runs of a flow with an additional column value filter.

The more common case (give me the last few runs for the UI, for costing, for 
stats etc) should be the cheaper one, and the more rare (and selective one) 
with the version specified can be the more expensive one if one is not ready to 
tolerate some leniency in getting fewer rows back in return for guaranteed 
reasonable perf.

Another (but admittedly much much weaker) argument would be that storing 
version as a column if cheaper, since that part of the rowkey is not repeatedly 
stored for each column. In OpenTSDB for example, there are many lookup codes in 
order to make the rowkeys compact.
We could for example codify the cluster and instead of using a larger (human 
readable) name such as cluster@datacenter, one could use an integer value with 
a lookup table. For hRaven we never ended up making that optimization because 
the additional complexity of lookups, not having more readable keys etc. did 
not seem to be needed. The performance was simply good enough.

This view has a skew to the use-cases that we have seen in our production 
environment with hRaven. I could see somebody arguing for a different set of 
usecases and priorities. As with all schema design, there is probably no right 
or wrong answers in the abstract without considering the relative priority of 
the various uses.

> Decide if  flow version should be part of row key or column
> -----------------------------------------------------------
>
>                 Key: YARN-3699
>                 URL: https://issues.apache.org/jira/browse/YARN-3699
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vrushali C
>
> Based on discussions in YARN-3411 with [~djp], filing jira for continuing 
> discussion on putting the flow version in rowkey or column. 
> Either phoenix/hbase approach will update the jira with the conclusions..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to