[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-09-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727227#comment-14727227
 ] 

Varun Saxena commented on YARN-3901:


Class {{org.apache.hadoop.yarn.server.timelineservice.storage.flow.Attribute}} 
has been missed from the patch

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.WIP.2.patch, 
> YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-09-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725852#comment-14725852
 ] 

Varun Saxena commented on YARN-3901:


Yes, exactly the same exception.

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-09-01 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725845#comment-14725845
 ] 

Sangjin Lee commented on YARN-3901:
---

Yeah, I can basically reproduce this issue too. What you said is largely 
correct. If you try to read the flow run metrics via 
{{readResultsWithTimestamps()}}, it will choke on 
{{GenericObjectMapper.read()}}. This is what I see:

{noformat}
org.codehaus.jackson.JsonParseException: Illegal character ((CTRL-CHAR, code 
0)): only regular white space (\r, \n, \t) is allowed between tokens
 at [Source: [B@19fcfd96; line: 1, column: 2]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at 
org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at 
org.codehaus.jackson.impl.JsonParserMinimalBase._throwInvalidSpace(JsonParserMinimalBase.java:467)
at 
org.codehaus.jackson.impl.ReaderBasedParser._skipWSOrEnd(ReaderBasedParser.java:1491)
at 
org.codehaus.jackson.impl.ReaderBasedParser.nextToken(ReaderBasedParser.java:368)
at 
org.codehaus.jackson.map.ObjectReader._initForReading(ObjectReader.java:828)
at 
org.codehaus.jackson.map.ObjectReader._bindAndClose(ObjectReader.java:752)
at 
org.codehaus.jackson.map.ObjectReader.readValue(ObjectReader.java:486)
at 
org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:93)
at 
org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:77)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnHelper.readResultsWithTimestamps(ColumnHelper.java:202)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.flow.FlowRunColumnPrefix.readResultsWithTimestamps(FlowRunColumnPrefix.java:155)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineReaderImpl.readMetrics(HBaseTimelineReaderImpl.java:513)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineReaderImpl.readFlowRunEntity(HBaseTimelineReaderImpl.java:669)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineReaderImpl.getFlowRunEntity(HBaseTimelineReaderImpl.java:625)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineReaderImpl.getEntity(HBaseTimelineReaderImpl.java:122)
{noformat}

Is that what you see, or is it different?

At any rate, the data is being written different that the GenericObjectMapper 
cannot parse it, it seems.

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-09-01 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725770#comment-14725770
 ] 

Vrushali C commented on YARN-3901:
--

Thanks for the update [~varun_saxena]. I will try this out in the order you 
have described and see how the write/read can be corrected. 

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-09-01 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726090#comment-14726090
 ] 

Vrushali C commented on YARN-3901:
--

[~varun_saxena] and [~sjlee0] I see from the code that value is being returned 
as long in the FlowScanner for metrics. I will update the code in my next 
upcoming patch and ensure this value can be read by GenericObjectMapper. 

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-09-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725521#comment-14725521
 ] 

Varun Saxena commented on YARN-3901:


While writing tests for YARN-4075 encountered below issue :
Applied YARN-3901 and YARN-4074 to work on top of it for YARN-4075. 

When the metrics are being written to flow run table they are being stored as 
longs but they are stored as strings otherwise(for other tables).
The only change I see in YARN-3901 for storing is that in ColumnHelper#store we 
are now using Put#add(KeyValue) instead of Put#addColumn. But even in that, 
when tags/attributes are not used(for other tables), behavior is same as before.
Because of this on the read side, when metrics are read by calling 
ColumnHelper#readResultsWithTimestamps, its throwing an exception while using 
GenericObjectMapper#read because bytes are stored as long.
 
I checked KeyValue#getValue and that's still a String before calling mutate(in 
all cases).
Is some conversion being done by HBase internally due to tags ? 

If yes, why this inconsistent behavior ?

cc [~vrushalic], [~sjlee0]

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-31 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724416#comment-14724416
 ] 

Li Lu commented on YARN-3901:
-

Hi [~vrushalic], I thought you might be working on a new version of patch, so 
perhaps I should wait for another version and then post my detailed commnets? 
Back to the previous discussions:

bq. During table creation time, we specify the coprocessor class. This can also 
be done later by alter table command as desired.
This is totally fine for our POC. However, we do need to think about deployment 
in the future, together with many other challenges like Phoenix and/or offline 
aggregation. 

bq. There are some differences between the two aggregations, I think. Not sure 
if the classes can be reused without complicating development efforts. For the 
PoC I would like to focus on these tables independently. We could file follow 
up jiras to refactor the code as we see fit when the whole picture emerges, 
does that sound good?
Sure. Actually I was wondering if the strategy we're using here is also 
applicable to app level aggregations, since both of them receives online data 
and store them in HBase (our online storage, compare to Phoenix). Our 
time-based aggregator works in a quite different way, where it reads data from 
online storage and aggregate data in a batched fashion. This said, maybe we 
should use the approach in this patch as a general "online aggregation" 
approach, and provide aggregate APIs in timeline metric class for offline 
aggregators? 

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-31 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724518#comment-14724518
 ] 

Vrushali C commented on YARN-3901:
--

Hey [~gtCarrera9]

Yes, I am working on an update to the patch. Had one question though.

bq. This is totally fine for our POC. However, we do need to think about 
deployment in the future, together with many other challenges like Phoenix 
and/or offline aggregation
I am not sure I understand. HBase Coprocessor classes are added to table 
schemas at table creation time or by altering the table. This isn't a short cut 
being done for this PoC. This is the way a coprocessor class is being added. 
Are you saying this is not fine? Am open to recommendations. 

thanks
Vrushali

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-31 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724527#comment-14724527
 ] 

Li Lu commented on YARN-3901:
-

bq. HBase Coprocessor classes are added to table schemas at table creation time 
or by altering the table. This isn't a short cut being done for this PoC. This 
is the way a coprocessor class is being added. Are you saying this is not fine?
Oh this is totally fine. I'm just trying to make sure we're considering 
deployment issue. I'm OK with the proposal here. 

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-31 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724556#comment-14724556
 ] 

Vrushali C commented on YARN-3901:
--

bq.  I'm just trying to make sure we're considering deployment issue. 

Yes, right on deployment and upgrade of coprocessors is something that 
needs to be thought through and we need to have guidelines setup for those. It 
will need restart of region servers and in production, we need to think this 
through carefully. 

> Populate flow run data in the flow_run table
> 
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-28 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720377#comment-14720377
 ] 

Li Lu commented on YARN-3901:
-

Hi [~vrushalic], thanks for the work! While looking at the latest patch, I have 
some general questions. IIUC, we now directly write data into the flow run 
related tables upon application start, finish, and periodic flush, and we only 
perform the aggregations in our coprocessors? I remember this design and I 
think this looks fine, but w.r.t the coprocessors, I'm unclear about:
# How are those coprocessor connected. Is it through an Hbase configuration 
externally, or there're some lines set them up in this patch that I missed 
(which is quite possible)? 
# I noticed you're performing aggregation work in the coprocessor 
(FlowScanner), this is slightly different to the approach in YARN-3816 (app 
level aggregation). My hunch is that we may need some sort of common APIs for 
aggregating metrics, so that we can centralize the aggregation logic? Or, why 
is the flow run level aggregation significantly different to app level 
aggregation (so that we cannot share the same aggregation logic)?

I'll keep looking at this patch later today, more comments may come during the 
weekend or next Monday. 

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.1.patch, 
 YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-28 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720392#comment-14720392
 ] 

Vrushali C commented on YARN-3901:
--

Hi [~gtCarrera9]

Thanks for the first review pass! To answer your questions:
bq.  IIUC, we now directly write data into the flow run related tables upon 
application start, finish, and periodic flush, and we only perform the 
aggregations in our coprocessors
Yes, data is written to flow run and flow activity tables in a quick simple 
write but the correct values to be returned are determined at read time AND 
(TBD) at flush/compaction time. During flush/compaction, the data from various 
cells will be 'merged' into fewer number of cells so that next read calls are 
faster.

bq.  How are those coprocessor connected. Is it through an Hbase configuration 
externally, or there're some lines set them up in this patch that I missed 
(which is quite possible)?

During table creation time, we specify the coprocessor class. This can also be 
done later by alter table command as desired.

bq. I noticed you're performing aggregation work in the coprocessor 
(FlowScanner), this is slightly different to the approach in YARN-3816 (app 
level aggregation). My hunch is that we may need some sort of common APIs for 
aggregating metrics, so that we can centralize the aggregation logic? Or, why 
is the flow run level aggregation significantly different to app level 
aggregation (so that we cannot share the same aggregation logic)?

There are some differences between the two aggregations, I think. Not sure if 
the classes can be reused without complicating development efforts. For the PoC 
I would like to focus on these tables independently. We could file follow up 
jiras to refactor the code as we see fit when the whole picture emerges, does 
that sound good?

Keep the questions coming, thanks!


 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.1.patch, 
 YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-28 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720837#comment-14720837
 ] 

Joep Rottinghuis commented on YARN-3901:


Reviewed / discussed 1.patch with [~vrushalic]
Comments may sound cryptic to others, but roughly we discussed these changes to 
make things generic (and more clear / reusable for the long run):

No timestamp needed in FlowActivity table. Runs can start one day and end 
another.
Probably start without, add later if needed. 

Min/Max how does the use of app-id be needed?
FlowScanner currentMinCell should not consider app ID.
If there is a start time for an app id, and then later another start, we should 
still keep the min, not the latest value.

UI based on FlowActivity can enumerate active flows for that day, plus show 
number of runs, and # of distinct versions.

Update javadoc on FlowRunKey.

FlowRunTable add increment and decrement for number of running apps (during 
start and app end).

MIN, MAX, SUM, SUM_FINAL should be AggOps

Aggregation dimension = metric name (stored in column)
Aggregation compaction dimension = application id

For store, make the Attributes... the last argument.
An attribute is a tuple of String, byte[]

The MIN AggregationOperation should have a createAttribute method that takes
an AggCompactionDimension as argument and return an Attribute.

Assumption is that all the cells in a put are the same operation.

In the general coprocessor, read the attribute (does not have to be unique).
Always add a tag witht eh aggregationCompaction dimension.
Set compaction tag only if compaction needs to be done (if the operation is 
SUM_FINAL).



 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.1.patch, 
 YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-21 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707617#comment-14707617
 ] 

Vrushali C commented on YARN-3901:
--


For long running apps, an entry needs to be made into the flow activity table. 
Being tracked in jira YARN-4069. 

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.1.patch, 
 YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-19 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703062#comment-14703062
 ] 

Vrushali C commented on YARN-3901:
--


Filed https://issues.apache.org/jira/browse/YARN-4062 for dealing with flush 
and compaction for the flow run table. This perhaps needs a new scanner to deal 
with the processing during flushing and compaction stages. This work is 
independent of the client interfacing for the flow run table and can proceed in 
parallel hence filed YARN-4062. It will help to have separate tests for this as 
well. 

I will now rebase the WIP patch against the latest branch and add in more 
comments and do a bit more code cleanup and upload a patch for review very soon.

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.2.patch, 
 YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-18 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701775#comment-14701775
 ] 

Vrushali C commented on YARN-3901:
--

I see, yes, will name it accordingly. 

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-17 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700546#comment-14700546
 ] 

Sangjin Lee commented on YARN-3901:
---

This may be a good idea. The only thing I'd add to that is the name 
{{equals()}} may not be the best name. The {{equals()}} method is specific 
about the contract, and we shouldn't implement it to support equality between a 
string and the timeline entity type. And since {{TimelineEntiteType}} is an 
enum, you can't override it anyway. Any other method might be just fine. For 
example,

{code}
public enum TimelineEntityType {
  ...
  public boolean typeMatches(String type) {
return toString().equals(type);
  }
}
{code}

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-17 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699937#comment-14699937
 ] 

Joep Rottinghuis commented on YARN-3901:


It seems that in general in order to get information from the client to the 
coprocessor we have two mechanisms:
1) use put attributes
2) use Cell tags.

We know that on the read side tags are stripped off.
Did you have a specific reason to choose one of the other?

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-17 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699804#comment-14699804
 ] 

Joep Rottinghuis commented on YARN-3901:


Presumably the storeWithTag method is just temporary until we have this worked 
out and then the store method would take a list of tags (and for those cases 
not using tags would end up passing null)?

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-17 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699795#comment-14699795
 ] 

Joep Rottinghuis commented on YARN-3901:


Starting to go through. Just as a note (not necessarily related to this patch) 
when we do things like
{noformat}
te.getType().equals(TimelineEntityType.YARN_APPLICATION.toString())
{noformat}

I'm wondering if we should create an equals method on TImelineEntityType so we 
can simply write:

{noformat}
TimelineEntityType.YARN_APPLICATION.equals(te.getType())
{noformat}
That would be shorter and can also deal with nulls w/o need to check if 
te.getType() could ever return a null.

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-17 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699978#comment-14699978
 ] 

Vrushali C commented on YARN-3901:
--

Thanks [~jrottinghuis] for looking into my wip patch.

I also thought about adding in the equals method and will do it in the next 
patch. I believe the type can be unset for a TimelineEntity so we do need to 
handle nulls. I ran into that in my test case.

Regarding using tags or put attributes, here is my take on it:
- Accessing put (or get or any mutation attributes is much cleaner) than 
accessing cell tags.
- But attributes are common to the entire Put (or that mutation) where as tags 
are specific to cell (exactly one cell). 
- Attributes of a mutation are not persisted into hbase, cell tags are.

So I would like to use cell tags as often as possible so that its perfectly 
clear that this tag belongs to this exact cell. I think I need to clean up the 
code to remove the Put attributes in the ColumnHelper. Will do it in the next 
patch.


 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-17 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700255#comment-14700255
 ] 

Joep Rottinghuis commented on YARN-3901:


After discussing with [~vrushalic] we concluded the following:
- Let's keep tag as an implementation detail in the coprocessor.
- Let's add a MapString, byte[] attributes argument to store for columns (and 
column prefixes) in order to pass values along
- Columns themselves know how to add additional attributes, namely the 
operation if needed: MIN, MAX, AGG
- Coprocessor will map these values to tags and store.
- Given that preput is evaluated for multiple items in a batch, reading during 
pre-put will yield incorrect result (even though it appears safe with flush of 
BufferedMutator). Therefore we need to switch to just adding a tag to a cell in 
pre-put and collapse min and max during read (flush and compactions).
- Add an attribute Compact in order to indicate that an app is done (therefore 
separating whether a value can be aggregated or not). Write this only for the 
last write, so that we don't store tags for default/common values and therefore 
keeping storage smaller.
- We don't need TimelineWriterUtils.join
- We don't need TimelineWriterUtils.ONE_IN_BYTES
- Collapse the wip storeWithTags into simply store.
- Coprocessor needs to detect if it is going from one column qualifier to the 
next. The peek method just ensures that the iteration stays within the row. 
Need to sit and think through how to do that most cleanly, perhaps with peek 
being able to show only the same column based on argument?

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-17 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700264#comment-14700264
 ] 

Joep Rottinghuis commented on YARN-3901:


peekNext also has to deal with limit correctly.

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-12 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693931#comment-14693931
 ] 

Vrushali C commented on YARN-3901:
--

Am working on this. I have been discussing the details with Joep and Sangjin 
and am in the process of adding in coprocessors and cell tags to enable the 
real time aggregation per flow run. It is an exciting use case of cell tags and 
I have a WIP patch. 

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C

 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694008#comment-14694008
 ] 

Vinod Kumar Vavilapalli commented on YARN-3901:
---

bq. Only the latest version of flow-level aggregated metrics will be kept, even 
if the entity and application level keep a timeseries.
Actually, at entity level we are not going to keep the time-series. 
Applications, Flow-runs, Users etc should be able to keep the time-series, no?

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C

 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-12 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694055#comment-14694055
 ] 

Vrushali C commented on YARN-3901:
--

The application and entity tables support timeseries storage. 

For flow runs, having a timeseries means having to normalize the time 
granularity for the timeseries. For example, say a flow has 3 apps. Say all 3 
apps run more or less concurrently. App1 has a metric1 value at ts1 but app2 
will have the metric1 at ts2 and say app3 has metric1 value at ts3, now for the 
metric at flow run level, what should be the values at ts1, ts2 and ts3? So we 
would need to normalize it across say per minute or per hour or something. 
There was another discussion on this, do you recollect the jira [~gtCarrera9] ? 
The end values can be aggregated since it's the final value, so we can say for 
this flow run for this metric, here the value (which would just a sum of that 
metric's final values from individual apps).

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C

 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-08-12 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694058#comment-14694058
 ] 

Vrushali C commented on YARN-3901:
--


For flows, users, queues which are aggregated on a time interval  like 
hourly/daily/weekly, the timeseries is even more harder to manage. For example, 
for users, we would sum up all the values for a particular metric on a given 
day across all applications. This is one total and can be summed up. How would 
a timeseries look if say app1 and app2 emit metric1 at ts1 but then next 
metric1 value for app1 is at ts2 and for app2 is at ts3. Now if we simply add 
up these values per timestamp at the user level, at ts1 there is a big number 
(sum from both apps), at ts2, there is only one value from app1 and at ts3 
there is only value from app2. So now we again need to normalize across a time 
window so that we can have a decent summation for a metric at a given time.


 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C

 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-07-09 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620799#comment-14620799
 ] 

Sangjin Lee commented on YARN-3901:
---

Hi [~zjshen], I was going to file JIRAs that covers splitting the application 
table and creating the app-to-flow table as well as the flow-version table, and 
work on them. Would you like to work on the app-to-flow table? I could then 
cover the others. Let me know.

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C

 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-07-09 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621051#comment-14621051
 ] 

Zhijie Shen commented on YARN-3901:
---

Yeah, I have dependency on this table for reader. If nobody is working on this 
table, I can take care of it.

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C

 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

2015-07-08 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619759#comment-14619759
 ] 

Zhijie Shen commented on YARN-3901:
---

[~vrushalic], just want to confirm with you that the jira won't cover app_flow 
table, right?

I need to flow mapping for implementing the reader apis against HBase backend. 
If it's not covered here, I can help to implement it in the scope of YARN-3049.

 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C

 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)