[
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143909#comment-15143909
]
Joep Rottinghuis commented on YARN-4062:
----------------------------------------
Been slow to make progress on review mainly due to other work taking away
attention.
I think that in general the patch will work as written.
While going through the design again from the top down I noticed (and discussed
with [~vrushalic] the following things:
- SUM is an aggregation operation that sums the latest value of each app in a
flow(run) (or the latest value of each aggregation dimension in the higher
level aggregation).
- The current MIN and MAX are actually different things. They are global mins
and global maxes in the sense that they keep only the lowest (or highest) value
we've ever seen by any app in the flow(run). While this is a totally valid
thing to do, there is actually something like a MIN and MAX value for each app
in a flow as well. What we currently call MIN and MAX should probably be called
GLOBAL_MIN and GLOBAL_MAX (or something similar). We can then also have a min
and max that work similar in keeping the latest value for each app (aggregation
dimension in general) and then computes the MIN and MAX at read-time. The flush
compaction then works the same for MIN, MAX, and SUM, while for GLOBAL_MIN, and
GLOBAL_MAX we can keep the current code behavior of shedding values as we go.
The GLOBAL_MIN and max are appropriate for the existing use-case of min start
time, but also for gauges. The new MIN and MAX would be appropriate to answer
questions such as, what is the app with smallest number of mappers in this
flow, or rather what is that #?
With this realization also came the awareness that SUM_FINAL is really a
different thing then SUM, MIN, MAX, GLOBAL_MIN and GLOBAL_MAX, despite what I
had earlier thought (and suggested). The former "this is the final value" is
something that has to happen at write time. It has to come from the writer
itself as an argument. Ideally the latter set of aggregation dimensions SUM,
MIN, MAX, etc. are really set of a per-column level and shouldn't be passed
from the client, but be instrumented by the ColumnHelper infrastructure
instead. We should probably use a different tag value for that.
Both aggregation dimension and this "FINAL_VALUE" or whatever abbreviation we
use are needed to determine the right thing to do for compaction. Only one
value needs to have this final value bit / tag set.
I'll continue to try to document all of these things so that it is a bit easier
to see visually what is going on.
> Add the flush and compaction functionality via coprocessors and scanners for
> flow run table
> -------------------------------------------------------------------------------------------
>
> Key: YARN-4062
> URL: https://issues.apache.org/jira/browse/YARN-4062
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Vrushali C
> Assignee: Vrushali C
> Labels: yarn-2928-1st-milestone
> Attachments: YARN-4062-YARN-2928.1.patch,
> YARN-4062-feature-YARN-2928.01.patch, YARN-4062-feature-YARN-2928.02.patch
>
>
> As part of YARN-3901, coprocessor and scanner is being added for storing into
> the flow_run table. It also needs a flush & compaction processing in the
> coprocessor and perhaps a new scanner to deal with the data during flushing
> and compaction stages.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)