[
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555104#comment-14555104
]
Joep Rottinghuis commented on YARN-3699:
----------------------------------------
The question really seems to be whether version is an identifying feature of a
flow.
By and large the version increases linearly (in hRaven I think we accomplished
that by having a version table with the version identifying, ie a hash of the
Pig script or a hash of the class that runs the scalding code. Then the
timestamp that we first see a new version becomes the version identifier).
In any case, we can ask give me the last 10 runs of this flow. Often you don't
care about the versions. Leaving the version out of the rowkey, that will allow
a quick linear range scan over 10 values.
If version is part of the key, and if it is possible for two versions to run
overlapping (user submits a run of the old version, and then a newer version,
then the older version again. More complex scenarios are possible), then one
cannot do a quick range scan over just the 10 runs starting with the latest.
Once has to scan over all the runs up to the next flow. In theory it is
possible that any version still had a run.
The reverse is true if one makes version a column. In that case getting the
last 3 runs of a version becomes more expensive. In hRaven the usecase for this
is when we want to do reducer estimation. We want to query back the last 3 runs
of a flow and compare inputs and outputs of the various jobs in a flow.
However, if one changes the Pig script or Scalding code then the DAG changes
and one can no longer safely use inputs and outputs from those previous runs
(with a different version). In theory we could scan the entire flow to see if
there were ever any runs for this version. Particularly for the reducer
estimation this is not needed. We can simply add a limit and say, give me only
the last 3 runs in the last week. Even if we set a limit of let's say 100 runs
that is fine. The query then becomes, give me the last 3 runs of this flow of
this specific version out of the last 100 runs (of any version).
We would error on the side of returning fewer runs. Specifically for the
reducer estimation case, that is essentially no worse than the first three runs
of any flow version, so we simply fall back on the sampling based estimator or
a size based heuristic.
Even if we do have to get an accurate answer, this still is a larger range scan
over all the runs of a flow with an additional column value filter.
The more common case (give me the last few runs for the UI, for costing, for
stats etc) should be the cheaper one, and the more rare (and selective one)
with the version specified can be the more expensive one if one is not ready to
tolerate some leniency in getting fewer rows back in return for guaranteed
reasonable perf.
Another (but admittedly much much weaker) argument would be that storing
version as a column if cheaper, since that part of the rowkey is not repeatedly
stored for each column. In OpenTSDB for example, there are many lookup codes in
order to make the rowkeys compact.
We could for example codify the cluster and instead of using a larger (human
readable) name such as cluster@datacenter, one could use an integer value with
a lookup table. For hRaven we never ended up making that optimization because
the additional complexity of lookups, not having more readable keys etc. did
not seem to be needed. The performance was simply good enough.
This view has a skew to the use-cases that we have seen in our production
environment with hRaven. I could see somebody arguing for a different set of
usecases and priorities. As with all schema design, there is probably no right
or wrong answers in the abstract without considering the relative priority of
the various uses.
> Decide if flow version should be part of row key or column
> -----------------------------------------------------------
>
> Key: YARN-3699
> URL: https://issues.apache.org/jira/browse/YARN-3699
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Vrushali C
>
> Based on discussions in YARN-3411 with [~djp], filing jira for continuing
> discussion on putting the flow version in rowkey or column.
> Either phoenix/hbase approach will update the jira with the conclusions..
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)