Vrushali C commented on YARN-4053:

Thanks [~varun_saxena] for the patch and [~djp] , [~gtCarrera], 
[~Naganarasimha], [~sjlee0] and [~jrottinghuis] for the discussion so far!

[~jrottinghuis] , [~sjlee0] and I had an offline discussion on this yesterday. 
We discussed at length along the following vectors:
- metric datatype: long, double, either or, both?
- metric type storage and retrieval for: single values vs timeseries
- metrics in the context of aggregation: how to indicate whether to aggregate 
or no.
- operations on metrics: sum vs average, min/max

To summarize the discussion:

- Our proposal is to proceed with supporting only longs for now. We went over 
several situations of how to store and query for decimal numbers: as Doubles or 
as numerator/denominator, how to use filters while scanning for such stored 
values,  how would aggregation look at it etc. We thought about which metrics 
are to be stored as Doubles and how the precision might affect aggregation. We 
finally concluded that we should start with storing longs only and make the 
code strictly accept longs (not even ints or shorts).

- For single value vs time series, we suggest using a column prefix to 
distinguish them. For the read path, we can assume it is a single value unless 
specifically specified by the client as a time series (as clients would need to 
intend to read time series explicitly).

- Regarding indicating whether to aggregate or not, we suggest to rely mostly 
on the flow run aggregation. For those use cases that need to access metrics 
off of tables other than the flow run table (e.g. time-based aggregation), we 
need to explore ways to specify this information as input (config, etc.) 

- So, the current patch is along the lines of our proposal of using longs for 
metrics. But we are considering a different approach of creating a "converter" 
type and implementation. For other non metric columns, a "generic" converter 
that uses the GenericObjectMapper can be created and used implicitly. For the 
numeric (long) columns, a long converter would be used explicitly. We also need 
to revisit how it's done in FlowScanner (it missed one of the places in the 
current patch for example). We need to get at the instances of ColumnPrefix and 
ColumnFamily, etc. and use them to get the converter in the flow scanner.

@Varun Would it be fine if I took over this jira to patch it with the above 


> Change the way metric values are stored in HBase Storage
> --------------------------------------------------------
>                 Key: YARN-4053
>                 URL: https://issues.apache.org/jira/browse/YARN-4053
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Varun Saxena
>            Assignee: Varun Saxena
>         Attachments: YARN-4053-YARN-2928.01.patch, 
> YARN-4053-YARN-2928.02.patch
> Currently HBase implementation uses GenericObjectMapper to convert and store 
> values in backend HBase storage. This converts everything into a string 
> representation(ASCII/UTF-8 encoded byte array).
> While this is fine in most cases, it does not quite serve our use case for 
> metrics. 
> So we need to decide how are we going to encode and decode metric values and 
> store them in HBase.

This message was sent by Atlassian JIRA

Reply via email to