Thanks for taking a look. I took the liberty of creating a pull request for this.
https://github.com/apache/flink/pull/6211 It would be great if you guys could take a look at it and see if it makes sense. I tried it out on our servers and it seems to do the job On Tue, 26 Jun 2018 at 18:47, Chesnay Schepler <[email protected]> wrote: > Great work on debugging this, you're exactly right. > > The children we add to the collector have to be removed individually when > a metric is unregistered. > > If the collector is a io.prometheus.client.Gauge we can use the #remove() > method. > For histograms we will have to modify our HistogramSummaryProxy class to > allow removing individual histograms. > > I've filed FLINK-9665 <https://issues.apache.org/jira/browse/FLINK-9665>. > > On 26.06.2018 17:28, jelmer wrote: > > Hi Chesnay, sorry for the late reply. I did not have time to look into > this sooner > > I did what you suggested. Added some logging to the PrometheusReporter > like this : > > > https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a > > And deployed a custom build of the reporter to our test environment. > > I managed to reproduce the issue like this > > 1. Deploy job A : it lands on worker 1 > 2. Deploy job B : it lands on worker 1, take note of the job id > 3. Redeploy job b by canceling it from a savepoint and deploying it again > from the savepoint : it lands on worker 3 > 4. Execute curl -s http://localhost:9249/metrics | grep "job id from step > 2" on worker 1. The metrics are still exposed even though the job is > canceled > > I attached a piece of the log to the email. What I notice is that the two > jobs register metrics with the same scoped metric name. In this case > flink_taskmanager_job_task_buffers_inputQueueLength. > > The prometheus exporter seems to use reference counting for the metrics > and the metrics will only be removed when the count is 0, canceling job B > will lower the counter by 5 but because job A still is deployed the count > is not 1 so the metric never gets unregistered > > Canceling job A will remove the lingering metrics from the old job B > > It seems to me that this is a bug and that the childs that are being > added in notifyOfAddedMetric > <https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163> > should be removed in notifyOfRemovedMetric > > Can you confirm this ? > > > --Jelmer > > > > On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <[email protected]> wrote: > >> I remember that another user reported something similar, but he wasn't >> using the PrometheusReporter. see >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html >> >> We couldn't find the cause, but my suspicion was FLINK-8946 which will be >> fixed in 1.4.3 . >> You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see >> whether this actually caused it. >> >> Alternatively, if you can reproduce this it would be immensely helpful if >> you could modify the PrometheusReporter and log all notifications about >> added or removed metrics. >> >> On 15.06.2018 15:42, Till Rohrmann wrote: >> >> Hi, >> >> this sounds very strange. I just tried it out locally with with a >> standard metric and the Prometheus metrics seem to be unregistered after >> the job has reached a terminal state. Thus, it looks as if the standard >> metrics are properly removed from `CollectorRegistry.defaultRegistry`. >> Could you check the log files whether they contain anything suspicious >> about a failed metric deregistration a la `There was a problem >> unregistering metric`? >> >> I've also pulled in Chesnay who knows more about the metric reporters. >> >> Cheers, >> Till >> >> On Thu, Jun 14, 2018 at 11:34 PM jelmer <[email protected]> wrote: >> >>> Hi >>> >>> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2 >>> >>> And I am looking into an issue where it seems that somehow in some cases >>> the metrics registered >>> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup >>> (flink_taskmanager_job_task_buffers_outPoolUsage etc) are not being >>> unregistered in prometheus in case of a job restart >>> >>> Eventually this seems to cause a java.lang.NoClassDefFoundError: >>> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of >>> the job is deployed because the jar file >>> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon >>> deployment of the new job but the url classloader still points to it and it >>> cannot find stats/Rate$1 (some synthetically generated code generated >>> by the java compiler because its a switch on an enum) >>> >>> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ? >>> Can somebody any pointers as to where to look to tackle this ? >>> >>> Attached screenshot shows what classloader that cannot be garbage >>> collected with the gc root >>> >>> >> >
