Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

jelmer Wed, 27 Jun 2018 00:59:01 -0700

Thanks for taking a look. I took the liberty of creating a pull request for
this.


https://github.com/apache/flink/pull/6211

It would be great if you guys could take a look at it and see if it makes
sense. I tried it out on our servers and it seems to do the job


On Tue, 26 Jun 2018 at 18:47, Chesnay Schepler <[email protected]> wrote:

> Great work on debugging this, you're exactly right.
>
> The children we add to the collector have to be removed individually when
> a metric is unregistered.
>
> If the collector is a io.prometheus.client.Gauge we can use the #remove() 
> method.
> For histograms we will have to modify our HistogramSummaryProxy class to
> allow removing individual histograms.
>
> I've filed FLINK-9665 <https://issues.apache.org/jira/browse/FLINK-9665>.
>
> On 26.06.2018 17:28, jelmer wrote:
>
> Hi Chesnay, sorry for the late reply. I did not have time to look into
> this sooner
>
> I did what you suggested. Added some logging to the PrometheusReporter
> like this :
>
>
> https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a
>
> And deployed a custom build of the reporter to our test environment.
>
> I managed to reproduce the issue like this
>
> 1. Deploy job A : it lands on worker 1
> 2. Deploy job B : it lands on worker 1, take note of the job id
> 3. Redeploy job b by canceling it from a savepoint and deploying it again
> from the savepoint : it lands on worker 3
> 4. Execute curl -s http://localhost:9249/metrics | grep "job id from step
> 2" on worker 1. The metrics are still exposed even though the job is
> canceled
>
> I attached a piece of the log to the email. What I notice is that the two
> jobs register metrics with the same scoped metric name. In this case
> flink_taskmanager_job_task_buffers_inputQueueLength.
>
> The prometheus exporter seems to use reference counting for the metrics
> and the metrics will only be removed when the count is 0, canceling job B
> will lower the counter by 5 but because job A still is deployed the count
> is not 1 so the metric never gets unregistered
>
> Canceling job A will remove the lingering metrics from the old job B
>
> It seems to me that this is a bug and that the childs that are being
> added in notifyOfAddedMetric
> <https://github.com/jelmerk/flink/commit/58779ee60a8c3961f3eb2c487c603c33822bba8a#diff-36ff6f170e359d30a1265b43659443bfR163>
> should be removed in notifyOfRemovedMetric
>
> Can you confirm this ?
>
>
> --Jelmer
>
>
>
> On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler <[email protected]> wrote:
>
>> I remember that another user reported something similar, but he wasn't
>> using the PrometheusReporter. see
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-metrics-disappearing-after-job-crash-restart-tt20420.html
>>
>> We couldn't find the cause, but my suspicion was FLINK-8946 which will be
>> fixed in 1.4.3 .
>> You could cherry-pick 8b046fafb6ee77a86e360f6b792e7f73399239bd and see
>> whether this actually caused it.
>>
>> Alternatively, if you can reproduce this it would be immensely helpful if
>> you could modify the PrometheusReporter and log all notifications about
>> added or removed metrics.
>>
>> On 15.06.2018 15:42, Till Rohrmann wrote:
>>
>> Hi,
>>
>> this sounds very strange. I just tried it out locally with with a
>> standard metric and the Prometheus metrics seem to be unregistered after
>> the job has reached a terminal state. Thus, it looks as if the standard
>> metrics are properly removed from `CollectorRegistry.defaultRegistry`.
>> Could you check the log files whether they contain anything suspicious
>> about a failed metric deregistration a la `There was a problem
>> unregistering metric`?
>>
>> I've also pulled in Chesnay who knows more about the metric reporters.
>>
>> Cheers,
>> Till
>>
>> On Thu, Jun 14, 2018 at 11:34 PM jelmer <[email protected]> wrote:
>>
>>> Hi
>>>
>>> We are using flink-metrics-prometheus for reporting on apache flink 1.4.2
>>>
>>> And I am looking into an issue where it seems that somehow in some cases
>>> the metrics registered
>>> by org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup
>>> (flink_taskmanager_job_task_buffers_outPoolUsage etc)  are not being
>>> unregistered in prometheus in case of a job restart
>>>
>>> Eventually this seems to cause a java.lang.NoClassDefFoundError:
>>> org/apache/kafka/common/metrics/stats/Rate$1 error when a new version of
>>> the job is deployed  because the jar file
>>> in /tmp/blobStore-foo/job_bar/blob_p-baz-qux has been removed upon
>>> deployment of the new job but the url classloader still points to it and it
>>> cannot find stats/Rate$1 (some synthetically generated code generated
>>> by the java compiler because its a switch on an enum)
>>>
>>> Has anybody come across this issue ? Has it possibly been fixed in 1.5 ?
>>> Can somebody any pointers as to where to look to tackle this ?
>>>
>>> Attached screenshot shows what classloader that cannot be garbage
>>> collected with the gc root
>>>
>>>
>>
>

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

Reply via email to