[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641026#comment-14641026
 ] 

Vrushali C commented on YARN-3908:
----------------------------------

Thanks everyone for the discussion. I will upload another patch on this soon. 
Sangjin, Joep and I also had some more offline discussions on this over the 
last few days. We considered two options:
1. store the event timestamp as the hbase cell timestamp.
2. store the event timestamp as part of the column key.

In the first approach, it is easier to query for time range queries, for 
example, give me the events that occurred in this time range. The column names 
look cleaner too. The downside of the first approach is that, we need to setup 
the column family info to keep multiple versions and ensure other columns than 
the event columns don't store multiple versions, which is not a very clean way 
to store it. Yet another option is to store event information in the metrics 
family but that does not actually belong in that column family, so we are 
mixing things, which will make it harder while aggregating metrics.  

So based on these points, we plan to go with approach #2 : storing the event 
timestamp as part of the column key. I will be making some changes to this 
patch accordingly. The event information will be stored in the info column 
family. The timestamps will be part of the column name. So  it will be stored 
as: 
{code} e!eventId?eventInfoKey?eventTimestamp : eventInfoValue {code}

For reader:
There is a {code} org.apache.hadoop.hbase.filter.ColumnPrefixFilter {code} 
which can be used to scan specific column keys. Wrt to chronological ordering, 
there needs to be some filtering in the reader code to pick the event info 
key-values that belong to the latest timestamp. 

For example, in the eg given by [~zjshen] above:
b.q.  i think proper logic is: if we put <event1, ts1> and <event1, ts2>, we 
should have two separate records persisted; and if we put <event1, ts1, info: 
[k1=v1, k2=v2]> and <event1, ts1, info: [k1=v1']> again, we should update the 
same record and let k1=v1'.
Yes, this will be stored as you describe. But, for reading, we will get back 
all values that belong to all event timestamps since they will be part of the 
column key , so now reader needs to know which ones to return.



> Bugs in HBaseTimelineWriterImpl
> -------------------------------
>
>                 Key: YARN-3908
>                 URL: https://issues.apache.org/jira/browse/YARN-3908
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Zhijie Shen
>            Assignee: Vrushali C
>         Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch, 
> YARN-3908-YARN-2928.005.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to