[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-87798034 Confirmed ;) Looking forward talking to you tomorrow. My google hangout id is metrob...@gmail.com. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user bhatsachin commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-87796282 Great. 17:00 India Time Zone (UTC+05:30) would be perfect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-87781100 Cool. I'm available between 16:30 - 20:30 India Time Zone (UTC+05:30). Is that possible for you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user bhatsachin commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-87776384 Thanks a lot Robert, let us have a hangout session tomorrow (Tuesday). Please suggest a time of your convenience. My other friends from IIT Mandi will also join in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/421 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-86898969 I've filed a JIRA for the changes requested here: https://issues.apache.org/jira/browse/FLINK-1792 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-86898299 Hey @bhatsachin, I've merged the change to master. If you want, we can do a quick hangout or skype call to discuss potential contributions from your side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-86898282 Hey @bhatsachin, I've merged the change to master. If you want, we can do a quick hangout or skype call to discuss potential contributions from your side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-86517135 I rebased the code to the current master in the branch "flink1501-rebased" in my github (https://github.com/rmetzger/flink/tree/flink1501-rebased). As soon as the tests are going through, I'll merge it to master: https://travis-ci.org/rmetzger/flink/builds/55949997 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-86440012 No, there are no objections. I'll rebase this pull request to the current master and merge it later today. I think the next steps are: - Get the CPU utilization in % from each TaskManager process - Remove the metrics graph from the overview and only show the current stats as numbers (cpu load, heap utilization) and add a button to enable the detailed graph. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user bhatsachin commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-86423248 Hey Robert, could you please enlist the further monitoring enhancements required for this pull request and are there any objections to the merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-84935408 Hey @bhatsachin, I've started working on the per job monitoring .. but its currently in a work in progress state and I did not find time to finish it yet. If you are interested in working on the topic, I would actually suggest to enhance the monitoring I've added in this pull request (the TaskManager monitoring). If nobody has any objections, I would like to merge this change in the next 24 hours. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user bhatsachin commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-84282886 Hey, I am one of the IIT Mandi students contributing to Flink. I would like to pick up a task pertaining to the monitoring enhancements. Robert, what is the status on the per job monitoring task? Are there any further changes to this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75600898 I didn't fully understand that you wanted to have one chart containing all task managers' load. That's a good thing. If it is only one chart, the overhead to update it should not be as high as creating a chart for every task manager (like it is now). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75578714 I am not sure about that. Would you like to scroll through say 100 detailed charts where only 5 fit on a screen to check whether there is one ore more misbehaving nodes? Three random nodes don't tell you a lot. There might always be another misbehaving one for which you have to check if you think something is not working alright. Having a small overview with the possibility for detailed analysis is the way to go in the long run, IMO. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75566621 @fhueske The user could do so by selecting "Show all task managers" and then identify the struggling task manager. For large cluster setups, it makes sense to sample just from a few task managers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75564012 Thanks for the detailed response. I am not sure how helpful it is to show three random TMs (incl. a shuffling button to show other random ones). I think it is not uncommon, that a single or few nodes are struggling (data skew, hw problems, ...) and it would be IMO very cool if a user could quickly identify such a node and get the detailed stats. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75554187 Looks really nice and informative :+1: Some suggestions: - It would be great if one could specify the number of task managers to see. If the number of task managers shown is smaller than the total number of task managers, there should be a "shuffle" button to show a random selection of task managers. - Add some information about the different memory statistics. The labels might not be intuitive for the average user. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75545711 Thanks everybody for the positive feedback! > What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator. On the OS load: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages I totally agree that the OS load is not a very good metric for our purposes. The reason why I didn't try to get better metrics for this is that I didn't want to play "ugly tricks" to get them. My code is getting the metrics only via the management beans. The `OperatingSystemMXBean` is only exposing the load and the number of processor cores: http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html#getSystemLoadAverage() There is another implementation of the `OperatingSystemMXBean` (https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html) which is also exposing stuff like `getProcessCpuLoad()`. But the availability of this management bean depends on the used JVM version etc. Another way to get the CPU load of the process would be parsing the output of `ps` or `top`. But that also falls into the category of "ugly tricks". I think we should aim for getting those metrics into the system as well. Adding them is a matter of registering another Gauge in the TaskManager's metrics registry and visualizing the JSON output. I hope that these kinds of refinements are done by external contributors. Once this PR has been merged, I'll file a JIRA to improve the CPU monitoring. >What are the current options for showing the detailed metrics? I see a "show 3 TMs" and "show all TMs" button in the screenshot? Can you select which three to show? No, you cannot choose which three TMs. I added these buttons because starting a large Flink cluster (50+ nodes) will cause quite some load on the browser updating all the charts. Usually its sufficient to see monitor the load of a few TMs only, because they are doing mostly the same (ideally). But I agree that there is room for improvement. > How about we open a document and sketch the design of the monitoring and create smaller PRs to get there step-by-step. I totally agree that we should do small incremental improvements. As I said in the PR description, the primary purpose of this PR is to get the basic monitoring infrastructure in place, how we present the stuff in the end is subject to further PRs. I have started working on the "per-job" monitoring and found that I have to change some details of this PR as well. Depending on my progress on the "per-job" monitoring I might contribute the changes here together with the "per-job" metrics. If I don't have enough time this week to open a PR for the per job metrics this week, I'll merge this change to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75313413 This is much needed monitoring and really great! What are the current options for showing the detailed metrics? I see a "show 3 TMs" and "show all TMs" button in the screenshot? Can you select which three to show? How about showing small indicators (load, % non-Flink-managed heap, GC interval) with a simple color coding (red for hot, blue for cool). This would help to find TMs which are more loaded than others. The detailed view could be opened by clicking on the TMs. But we do not need to get the perfect solution at once. How about we open a document and sketch the design of the monitoring and create smaller PRs to get there step-by-step. This PR is definitely a huge step in the right direction! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user vasia commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75218017 This looks great! ^^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75209814 Indeed :-) What does the OS load mean? It would be really awesome to show the CPU load, too. I think this is a helpful indicator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
Github user hsaputra commented on the pull request: https://github.com/apache/flink/pull/421#issuecomment-75091767 You are the man, Robert! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1501] Add metrics library for monitorin...
GitHub user rmetzger opened a pull request: https://github.com/apache/flink/pull/421 [FLINK-1501] Add metrics library for monitoring TaskManagers Hey, I've spend some time exploring the [metrics](https://dropwizard.github.io/metrics/3.1.0/) library for improving the performance monitoring in Flink. This pull request is a first step into that direction. The primary objective is a clean integration of the JVM monitoring into our system. I spend probably 80% of the time in making the javascript frontend work. For that, I've used [rickshaw](https://github.com/shutterstock/rickshaw), a project also used by projects like Apache Ambari for creating nice graphs. Still, the visualization is not perfect and I would like to see incremental improvements there. The next step for me will be metrics for individual jobs. ![newmonitoring](https://cloud.githubusercontent.com/assets/89049/6268186/391731bc-b84b-11e4-8379-cbd5428651c4.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/rmetzger/flink flink1501 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/421.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #421 commit 13d17153ccb6adb84f74e72261223b61382f4371 Author: Robert Metzger Date: 2015-02-07T10:33:31Z [FLINK-1501] Add metrics library for monitoring TaskManagers --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---