[ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143909#comment-15143909
 ] 

Joep Rottinghuis commented on YARN-4062:
----------------------------------------

Been slow to make progress on review mainly due to other work taking away 
attention.
I think that in general the patch will work as written.
While going through the design again from the top down I noticed (and discussed 
with [~vrushalic] the following things:
- SUM is an aggregation operation that sums the latest value of each app in a 
flow(run) (or the latest value of each aggregation dimension in the higher 
level aggregation).
- The current MIN and MAX are actually different things. They are global mins 
and global maxes in the sense that they keep only the lowest (or highest) value 
we've ever seen by any app in the flow(run). While this is a totally valid 
thing to do, there is actually something like a MIN and MAX value for each app 
in a flow as well. What we currently call MIN and MAX should probably be called 
GLOBAL_MIN and GLOBAL_MAX (or something similar). We can then also have a min 
and max that work similar in keeping the latest value for each app (aggregation 
dimension in general) and then computes the MIN and MAX at read-time. The flush 
compaction then works the same for MIN, MAX, and SUM, while for GLOBAL_MIN, and 
GLOBAL_MAX we can keep the current code behavior of shedding values as we go.

The GLOBAL_MIN and max are appropriate for the existing use-case of min start 
time, but also for gauges. The new MIN and MAX would be appropriate to answer 
questions such as, what is the app with smallest number of mappers in this 
flow, or rather what is that #?

With this realization also came the awareness that SUM_FINAL is really a 
different thing then SUM, MIN, MAX, GLOBAL_MIN and GLOBAL_MAX, despite what I 
had earlier thought (and suggested). The former "this is the final value" is 
something that has to happen at write time. It has to come from the writer 
itself as an argument. Ideally the latter set of aggregation dimensions SUM, 
MIN, MAX, etc. are really set of a per-column level and shouldn't be passed 
from the client, but be instrumented by the ColumnHelper infrastructure 
instead. We should probably use a different tag value for that.
Both aggregation dimension and this "FINAL_VALUE" or whatever abbreviation we 
use are needed to determine the right thing to do for compaction. Only one 
value needs to have this final value bit / tag set.

I'll continue to try to document all of these things so that it is a bit easier 
to see visually what is going on.

> Add the flush and compaction functionality via coprocessors and scanners for 
> flow run table
> -------------------------------------------------------------------------------------------
>
>                 Key: YARN-4062
>                 URL: https://issues.apache.org/jira/browse/YARN-4062
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>              Labels: yarn-2928-1st-milestone
>         Attachments: YARN-4062-YARN-2928.1.patch, 
> YARN-4062-feature-YARN-2928.01.patch, YARN-4062-feature-YARN-2928.02.patch
>
>
> As part of YARN-3901, coprocessor and scanner is being added for storing into 
> the flow_run table. It also needs a flush & compaction processing in the 
> coprocessor and perhaps a new scanner to deal with the data during flushing 
> and compaction stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to