[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-30 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-87798034
  
Confirmed ;)

Looking forward talking to you tomorrow. My google hangout id is 
metrob...@gmail.com.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-30 Thread bhatsachin
Github user bhatsachin commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-87796282
  
Great. 17:00 India Time Zone (UTC+05:30) would be perfect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-30 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-87781100
  
Cool. 
I'm available between 16:30 - 20:30 India Time Zone (UTC+05:30). Is that 
possible for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-30 Thread bhatsachin
Github user bhatsachin commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-87776384
  
Thanks a lot Robert, let us have a hangout session tomorrow (Tuesday). 
Please suggest a time of your convenience. My other friends from IIT Mandi will 
also join in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/421


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-27 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-86898969
  
I've filed a JIRA for the changes requested here: 
https://issues.apache.org/jira/browse/FLINK-1792


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-27 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-86898299
  
Hey @bhatsachin,
I've merged the change to master.

If you want, we can do a quick hangout or skype call to discuss potential 
contributions from your side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-27 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-86898282
  
Hey @bhatsachin,
I've merged the change to master.

If you want, we can do a quick hangout or skype call to discuss potential 
contributions from your side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-26 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-86517135
  
I rebased the code to the current master in the branch "flink1501-rebased" 
in my github (https://github.com/rmetzger/flink/tree/flink1501-rebased).
As soon as the tests are going through, I'll merge it to master: 
https://travis-ci.org/rmetzger/flink/builds/55949997


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-26 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-86440012
  
No, there are no objections.
I'll rebase this pull request to the current master and merge it later 
today.

I think the next steps are:
- Get the CPU utilization in % from each TaskManager process
- Remove the metrics graph from the overview and only show the current 
stats as numbers (cpu load, heap utilization) and add a button to enable the 
detailed graph.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-26 Thread bhatsachin
Github user bhatsachin commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-86423248
  
Hey Robert, could you please enlist the further monitoring enhancements 
required for this pull request and are there any objections to the merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-23 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-84935408
  
Hey @bhatsachin, I've started working on the per job monitoring .. but its 
currently in a work in progress state and I did not find time to finish it yet.

If you are interested in working on the topic, I would actually suggest to 
enhance the monitoring I've added in this pull request (the TaskManager 
monitoring).


If nobody has any objections, I would like to merge this change in the next 
24 hours.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-03-21 Thread bhatsachin
Github user bhatsachin commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-84282886
  
Hey, I am one of the IIT Mandi students contributing to Flink. I would like 
to pick up a task pertaining to the monitoring enhancements. Robert, what is 
the status on the per job monitoring task? Are there any further changes to 
this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-23 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75600898
  
I didn't fully understand that you wanted to have one chart containing all 
task managers' load. That's a good thing. If it is only one chart, the overhead 
to update it should not be as high as creating a chart for every task manager 
(like it is now).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-23 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75578714
  
I am not sure about that. Would you like to scroll through say 100 detailed 
charts where only 5 fit on a screen to check whether there is one ore more 
misbehaving nodes?
Three random nodes don't tell you a lot. There might always be another 
misbehaving one for which you have to check if you think something is not 
working alright.

Having a small overview with the possibility for detailed analysis is the 
way to go in the long run, IMO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-23 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75566621
  
@fhueske The user could do so by selecting "Show all task managers" and 
then identify the struggling task manager. For large cluster setups, it makes 
sense to sample just from a few task managers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-23 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75564012
  
Thanks for the detailed response.
I am not sure how helpful it is to show three random TMs (incl. a shuffling 
button to show other random ones). I think it is not uncommon, that a single or 
few nodes are struggling (data skew, hw problems, ...) and it would be IMO very 
cool if a user could quickly identify such a node and get the detailed stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-23 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75554187
  
Looks really nice and informative :+1: 

Some suggestions:

- It would be great if one could specify the number of task managers to 
see. If the number of task managers shown is smaller than the total number of 
task managers, there should be a "shuffle" button to show a random selection of 
task managers.

- Add some information about the different memory statistics. The labels 
might not be intuitive for the average user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-23 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75545711
  
Thanks everybody for the positive feedback!
> What does the OS load mean? It would be really awesome to show the CPU 
load, too. I think this is a helpful indicator.

On the OS load: 
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

I totally agree that the OS load is not a very good metric for our 
purposes. 
The reason why I didn't try to get better metrics for this is that I didn't 
want to play "ugly tricks" to get them.
My code is getting the metrics only via the management beans. The 
`OperatingSystemMXBean` is only exposing the load and the number of processor 
cores:

http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html#getSystemLoadAverage()
There is another implementation of the `OperatingSystemMXBean` 
(https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html)
 which is also exposing stuff like `getProcessCpuLoad()`.
But the availability of this management bean depends on the used JVM 
version etc.

Another way to get the CPU load of the process would be parsing the output 
of `ps` or `top`. But that also falls into the category of "ugly tricks".
I think we should aim for getting those metrics into the system as well. 
Adding them is a matter of registering another Gauge in the TaskManager's 
metrics registry and visualizing the JSON output.
I hope that these kinds of refinements are done by external contributors.
Once this PR has been merged, I'll file a JIRA to improve the CPU 
monitoring.

>What are the current options for showing the detailed metrics? I see a 
"show 3 TMs" and "show all TMs" button in the screenshot? Can you select which 
three to show?

No, you cannot choose which three TMs. 
I added these buttons because starting a large Flink cluster (50+ nodes) 
will cause quite some load on the browser updating all the charts. Usually its 
sufficient to see monitor the load of a few TMs only, because they are doing 
mostly the same (ideally).
But I agree that there is room for improvement.

> How about we open a document and sketch the design of the monitoring and 
create smaller PRs to get there step-by-step.

I totally agree that we should do small incremental improvements. 
As I said in the PR description, the primary purpose of this PR is to get 
the basic monitoring infrastructure in place, how we present the stuff in the 
end is subject to further PRs.


I have started working on the "per-job" monitoring and found that I have to 
change some details of this PR as well.
Depending on my progress on the "per-job" monitoring I might contribute the 
changes here together with the "per-job" metrics. If I don't have enough time 
this week to open a PR for the per job metrics this week, I'll merge this 
change to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-20 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75313413
  
This is much needed monitoring and really great!
What are the current options for showing the detailed metrics? I see a 
"show 3 TMs" and "show all TMs" button in the screenshot? Can you select which 
three to show? 

How about showing small indicators (load, % non-Flink-managed heap, GC 
interval) with a simple color coding (red for hot, blue for cool). This would 
help to find TMs which are more loaded than others. The detailed view could be 
opened by clicking on the TMs.

But we do not need to get the perfect solution at once. 
How about we open a document and sketch the design of the monitoring and 
create smaller PRs to get there step-by-step.
This PR is definitely a huge step in the right direction!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-20 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75218017
  
This looks great! ^^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-20 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75209814
  
Indeed :-)

What does the OS load mean? It would be really awesome to show the CPU 
load, too. I think this is a helpful indicator. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-19 Thread hsaputra
Github user hsaputra commented on the pull request:

https://github.com/apache/flink/pull/421#issuecomment-75091767
  
You are the man, Robert!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...

2015-02-19 Thread rmetzger
GitHub user rmetzger opened a pull request:

https://github.com/apache/flink/pull/421

[FLINK-1501] Add metrics library for monitoring TaskManagers

Hey,
I've spend some time exploring the 
[metrics](https://dropwizard.github.io/metrics/3.1.0/) library for improving 
the performance monitoring in Flink.

This pull request is a first step into that direction. The primary 
objective is a clean integration of the JVM monitoring into our system.

I spend probably 80% of the time in making the javascript frontend work. 
For that, I've used [rickshaw](https://github.com/shutterstock/rickshaw), a 
project also used by projects like Apache Ambari for creating nice graphs.
Still, the visualization is not perfect and I would like to see incremental 
improvements there.

The next step for me will be metrics for individual jobs.


![newmonitoring](https://cloud.githubusercontent.com/assets/89049/6268186/391731bc-b84b-11e4-8379-cbd5428651c4.png)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rmetzger/flink flink1501

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #421


commit 13d17153ccb6adb84f74e72261223b61382f4371
Author: Robert Metzger 
Date:   2015-02-07T10:33:31Z

[FLINK-1501] Add metrics library for monitoring TaskManagers




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---