[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735452#comment-14735452
 ] 

Vrushali C commented on YARN-3901:
----------------------------------

Thanks [~sjlee0] for the review!

I will correct the variable ordering for static and private members as well as 
making variables final.

bq. l.210: Strictly speaking, GenericObjectMapper will return an integer if the 
value fits within an integer; so it's not exactly a concern for min/max 
(timestamps) but for caution we might want to stay with Number instead of long
Comparisons are not allowed for Number datatype. 
{code} 
The operator < is undefined for the argument type(s) java.lang.Number, 
java.lang.Number
{code} 

So I would have to do something like {code} Number d = a.longValue() + 
b.longValue(); {code}  Do you think this is better? 

bq. l.52: Is the TimestampGenerator class going to be used outside 
FlowRunCoprocessor? If not, I would argue that we should make it an inner class 
of FlowRunCoprocessor. At least we should make it non-public to keep it within 
the package. If it would see general use outside this class, then it might be 
better to make it a true public class in the common package. I suspect a 
non-public class might be what we want here.
I am thinking I will need this when the flush/compaction scanner is added in. 
If you'd like, I can move it in as a non-public class for now and then move it 
out if needed. 

bq. It's up to you, but you could leave the row key improvement to YARN-4074. 
That might be easier to manage the changes between yours and mine. I'm 
restructuring all *RowKey classes uniformly.
I actually needed this in the unit test while checking the FlowActivityTable 
contents, if you want I can take it out and you can add that test case in when 
you add in the RowKey changes? 

bq. l.144: This would mean that some cell timestamps would have the unit of the 
milliseconds and others would be in nanoseconds. I'm a little bit concerned if 
we ever interpret these timestamps incorrectly. Could there be a more explicit 
way of clearly differentiating them? I don't have good suggestions at the 
moment.
Yeah, I was thinking about that too. Right now, metrics will get their own 
timestamps. For other columns, we'd be using the nanoseconds. I am trying to 
see if we can just use milliseconds.

bq. it might be good to have short comments on what each method is testing
I did try to make the unit test names themselves descriptive like 
testFlowActivityTable or testWriteFlowRunMinMaxToHBase or 
testWriteFlowRunMetricsOneFlow or testWriteFlowActivityOneFlow but I agree some 
more comments in the unit test will surely help. 

Will upload a new patch shortly, thanks! 


> Populate flow run data in the flow_run & flow activity tables
> -------------------------------------------------------------
>
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.4.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to