[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622421#comment-15622421
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol closed the pull request at:

https://github.com/apache/flink/pull/2616


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15615178#comment-15615178
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/2616
  
The `CoordinatorShutdownTest` fixes look reasonable. 


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614841#comment-15614841
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2616
  
Both issues should be fixed now, but I'll let travis take another stab at 
it.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614772#comment-15614772
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2616
  
The `CoordinatorShutdownTest` can't be fixed with a cast. This test assumes 
that the actual ExecutionGraph is still available when a job is finished since 
it tries to access the CheckpointCoordinator within. This however no longer 
works; finished jobs are immediately archived, and the archived version does 
not contain the CheckpointCoordinator.

A possible fix is to not let the job fail immediately but block him, ask 
for the ExecutionGraph while being blocked (which, since the job is still 
running actually returns the runtime ExecutionGraph) and store the reference, 
and then let the job fail.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614739#comment-15614739
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2616
  
I just found another bug related to forked-chains. Right now only the 
output of a single operator is used as the task output counter, where as we 
should actually use both of them. Will fix it while merging.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612284#comment-15612284
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/2616
  
I tested the change locally, it works. +1 to merge.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612172#comment-15612172
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/2616
  
Thank you for rebasing.
This run: 
had the following error: 
https://s3.amazonaws.com/archive.travis-ci.org/jobs/170428661/log.txt
```
Failed tests: 
  CoordinatorShutdownTest.testCoordinatorShutsDownOnFailure:94 
org.apache.flink.runtime.executiongraph.ArchivedExecutionGraph cannot be cast 
to org.apache.flink.runtime.executiongraph.ExecutionGraph
```
I suspect we need to change the cast there to `AccessExecutionGraph`.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605046#comment-15605046
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2616
  
Note that 2 new methods were introduced into the `StreamConfig`: 
`setChainStart() / isChainStart()` to better determine which output counter 
should be used for the task. Previously this used the same logic as the 
operator name extraction, which was bugged for multi-chains.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605036#comment-15605036
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2616
  
@rmetzger Rebased version is up.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604837#comment-15604837
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2616
  
Will rebase now, this will also fix the test failure.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604835#comment-15604835
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/2616#discussion_r84868336
  
--- Diff: 
flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JobVertexDetailsHandler.java
 ---
@@ -99,11 +83,34 @@ public String handleRequest(ExecutionJobVertex 
jobVertex, Map pa
gen.writeNumberField("end-time", endTime);
gen.writeNumberField("duration", duration);
 
+   IOMetrics ioMetrics = 
vertex.getCurrentExecutionAttempt().getIOMetrics();
+
+   long numBytesIn = 0;
+   long numBytesOut = 0;
+   long numRecordsIn = 0;
+   long numRecordsOut = 0;
+
+   if (ioMetrics != null) { // execAttempt is already 
finished, use final metrics stored in ExecutionGraph
+   numBytesIn = ioMetrics.getNumBytesInLocal() + 
ioMetrics.getNumBytesInRemote();
+   numBytesOut = ioMetrics.getNumBytesOut();
+   numRecordsIn = ioMetrics.getNumRecordsIn();
+   numRecordsOut = ioMetrics.getNumRecordsOut();
+   } else { // execAttempt is still running, use 
MetricQueryService instead
+   fetcher.update();
+   MetricStore.SubtaskMetricStore metrics = 
fetcher.getMetricStore().getSubtaskMetricStore(vertex.getJobId().toString(), 
vertex.getJobvertexId().toString(), vertex.getParallelSubtaskIndex());
+   if (metrics != null) {
+   numBytesIn += 
Long.valueOf(metrics.getMetric("numBytesInLocal", "0")) + 
Long.valueOf(metrics.getMetric("numBytesInRemote", "0"));
+   numBytesOut += 
Long.valueOf(metrics.getMetric("numBytesOut", "0"));
+   numRecordsIn += 
Long.valueOf(metrics.getMetric("numRecordsIn", "0"));
+   numRecordsOut += 
Long.valueOf(metrics.getMetric("numRecordsOut", "0"));
--- End diff --

Agreed, I've opened a JIRA for that, see FLINK-4906.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604826#comment-15604826
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/2616#discussion_r84867488
  
--- Diff: 
flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JobDetailsHandler.java
 ---
@@ -147,11 +143,36 @@ public String handleRequest(ExecutionGraph graph, 
Map params) th
}
gen.writeEndObject();

+   long numBytesIn = 0;
+   long numBytesOut = 0;
+   long numRecordsIn = 0;
+   long numRecordsOut = 0;
+
+   for (ExecutionVertex vertex : ejv.getTaskVertices()) {
+   IOMetrics ioMetrics = 
vertex.getCurrentExecutionAttempt().getIOMetrics();
+
+   if (ioMetrics != null) { // execAttempt is 
already finished, use final metrics stored in ExecutionGraph
+   numBytesIn += 
ioMetrics.getNumBytesInLocal() + ioMetrics.getNumBytesInRemote();
--- End diff --

That's a relic from the previous iterations, `getNumBytesInTotal()` did not 
always exist :)


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604780#comment-15604780
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user rmetzger commented on a diff in the pull request:

https://github.com/apache/flink/pull/2616#discussion_r84697528
  
--- Diff: 
flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JobVertexDetailsHandler.java
 ---
@@ -99,11 +83,34 @@ public String handleRequest(ExecutionJobVertex 
jobVertex, Map pa
gen.writeNumberField("end-time", endTime);
gen.writeNumberField("duration", duration);
 
+   IOMetrics ioMetrics = 
vertex.getCurrentExecutionAttempt().getIOMetrics();
+
+   long numBytesIn = 0;
+   long numBytesOut = 0;
+   long numRecordsIn = 0;
+   long numRecordsOut = 0;
+
+   if (ioMetrics != null) { // execAttempt is already 
finished, use final metrics stored in ExecutionGraph
+   numBytesIn = ioMetrics.getNumBytesInLocal() + 
ioMetrics.getNumBytesInRemote();
+   numBytesOut = ioMetrics.getNumBytesOut();
+   numRecordsIn = ioMetrics.getNumRecordsIn();
+   numRecordsOut = ioMetrics.getNumRecordsOut();
+   } else { // execAttempt is still running, use 
MetricQueryService instead
+   fetcher.update();
+   MetricStore.SubtaskMetricStore metrics = 
fetcher.getMetricStore().getSubtaskMetricStore(vertex.getJobId().toString(), 
vertex.getJobvertexId().toString(), vertex.getParallelSubtaskIndex());
+   if (metrics != null) {
+   numBytesIn += 
Long.valueOf(metrics.getMetric("numBytesInLocal", "0")) + 
Long.valueOf(metrics.getMetric("numBytesInRemote", "0"));
+   numBytesOut += 
Long.valueOf(metrics.getMetric("numBytesOut", "0"));
+   numRecordsIn += 
Long.valueOf(metrics.getMetric("numRecordsIn", "0"));
+   numRecordsOut += 
Long.valueOf(metrics.getMetric("numRecordsOut", "0"));
--- End diff --

The metric names are used in many places. I think we should use constants 
for them.


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604779#comment-15604779
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

Github user rmetzger commented on a diff in the pull request:

https://github.com/apache/flink/pull/2616#discussion_r84696513
  
--- Diff: 
flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JobDetailsHandler.java
 ---
@@ -147,11 +143,36 @@ public String handleRequest(ExecutionGraph graph, 
Map params) th
}
gen.writeEndObject();

+   long numBytesIn = 0;
+   long numBytesOut = 0;
+   long numRecordsIn = 0;
+   long numRecordsOut = 0;
+
+   for (ExecutionVertex vertex : ejv.getTaskVertices()) {
+   IOMetrics ioMetrics = 
vertex.getCurrentExecutionAttempt().getIOMetrics();
+
+   if (ioMetrics != null) { // execAttempt is 
already finished, use final metrics stored in ExecutionGraph
+   numBytesIn += 
ioMetrics.getNumBytesInLocal() + ioMetrics.getNumBytesInRemote();
--- End diff --

(no need to update) There's a `getNumBytesInTotal()` method for this ;) 


> Port WebFrontend to new metric system
> -
>
> Key: FLINK-4733
> URL: https://issues.apache.org/jira/browse/FLINK-4733
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, TaskManager, Webfrontend
>Affects Versions: 1.1.2
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.2.0
>
>
> While the WebFrontend has access to the metric system it still relies on 
> older code in some parts.
> The TaskManager metrics are still gathered using the Codahale library and 
> send with the heartbeats.
> Task related metrics (numRecordsIn etc) are still gathered using 
> accumulators, which are accessed through the execution graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4733) Port WebFrontend to new metric system

2016-10-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559484#comment-15559484
 ] 

ASF GitHub Bot commented on FLINK-4733:
---

GitHub user zentol opened a pull request:

https://github.com/apache/flink/pull/2616

[FLINK-4733] Port WebInterface to metric system

# This PR relies on #2613, #2614 and #2615. Thus, the first 5 commits 
should not be reviewed here.

This PR ports the remaining parts of the WebInterface to rely on the metric 
system.

# TaskManager metrics

In a7011e8305d7c828fabc4245358c2d21568fd561 the TaskManagersHandler is 
modified to use the metric system. In addition, the garbage collector section 
in the WebInterface was enhanced to no longer rely on hard-coded GC names, but 
instead be dynamic. The recently introduced network metrics have been added as 
well.

cbff6d6aab80bc423a09aa6b62c80a2f409d796a then removes the remnants of the 
old metrics that are now unused. This affects the TaskManager(no longer gathers 
these metrics) and Heartbeat messages (no longer includes a metrics report). As 
a result the DropWizard dependency was removed. The transitive jackson 
dependency is now explicitly set for both flink-runtime and flink-runtime-web.

# Task metrics

The Webinterface shows how many records/bytes each task has received or 
sent. Until now these were gathered with system specific accumulators.

cab25496ff5991de60e757f68c5d5139c86f34ba these accumulators were removed.

Under the new system, bytes In/Out is measured per task (since it doesn't 
make sense within chained operators), while records In/Out is measured per 
operator. In order to display the records metrics for each task it was thus 
necessary to "reuse" some operator counters for the task.

This is implemented in 16983485198a61bec0418adb833508dcaf276170 by 
re-registering the numRecordsIn counter of the first operator in the chain and 
the the numRecordsOut counter of the last operator on the task level 

This re-use could (sadly) not be done automatically within the metric 
system. Instead 2 helper methods were added to the OperatorIOMetricGroup, which 
are called for example within BatchTask#invoke(), which forward the counters to 
the TaskIOMetricGroup where they are stored and re-registered.

With these metrics being re-registered they can be accessed easily via the 
MetricQueryService from the WebInterface handlers. The downside is that this 
service provides no guarantee that the most up-to-date metrics for a finished 
task will be transferred. It was thus necessary to store a snapshot of these 
IOMetrics within the ExecutionGraph, similar to the system accumulators, which 
the handlers could access as well.

The handlers were finally adjusted in 
8be5145a9406dc8d6d661299c9ee98aa09233df4. For running tasks they access metrics 
via the MetricQueryService, whereas for finished tasks they rely on the metrics 
stored in the ExecutionGraph.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zentol/flink 4733_metrics_port

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2616.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2616


commit 5f0f3598fa5d0fdf8b61d591e2bb94b74924ee0d
Author: zentol 
Date:   2016-10-07T11:02:10Z

[FLINK-4773] [metrics] [refactor] Rename IOMetricGroup to TaskIOMetricGroup

commit df40a58c74e7f0fc3feec4a5848f1627bf4537dd
Author: zentol 
Date:   2016-10-05T13:04:03Z

[FLINK-4773] [metrics] [refactor] Introduce OperatorIOMetricGroup

commit 2685f6a908a0ce4cc9fe3d97beca005ea3d59ee5
Author: zentol 
Date:   2016-10-07T08:11:31Z

[FLINK-4772] [metrics] Store metrics as strings in MetricStore

commit 33297e716a0a327fad20331813a582642c5e68e3
Author: zentol 
Date:   2016-10-07T08:16:49Z

[FLINK-4775] [metrics] Simplify MetricStore access

commit dfed8166272b361684594f61b401c38f0d68ebd6
Author: zentol 
Date:   2016-10-07T11:11:58Z

[FLINK-4774] [metrics] [hotfix] Fix scope concatenation in QueryScopeInfo

commit a7011e8305d7c828fabc4245358c2d21568fd561
Author: zentol 
Date:   2016-10-07T11:12:31Z

[FLINK-4733] [metrics] Port TaskManagersHandler

commit cbff6d6aab80bc423a09aa6b62c80a2f409d796a
Author: zentol 
Date:   2016-10-07T11:12:41Z

[FLINK-4733] [metrics] Remove old TaskManager metrics

commit cab25496ff5991de60e757f68c5d5139c86f34ba
Author: zentol 
Date:   2016-10-05T13:12:22Z

[FLINK-4733] [metrics] Remove system accumulators

commit 16983485198a61bec0418adb833508dcaf276170