Fabian, It does look like it may be related. I'll add a comment. After digging a bit more I found that the crash and lack of metrics were precipitated by the JobManager instance crashing and cycling, which caused the job to restart.
Chesnay, I didn't see anything interesting in our logs. Our reporter config is fairly straightforward (I think): metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter metrics.reporter.nr.interval: 60 SECONDS metrics.reporters: nr Nik Davis Software Engineer New Relic On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <ches...@apache.org> wrote: > Can you show us the metrics-related configuration parameters in > flink-conf.yaml? > > Please also check the logs for any warnings from the MetricGroup and > MetricRegistry > classes. > > > On 04.06.2018 10:44, Fabian Hueske wrote: > > Hi Nik, > > Can you have a look at this JIRA ticket [1] and check if it is related to > the problems your are facing? > If so, would you mind leaving a comment there? > > Thank you, > Fabian > > [1] https://issues.apache.org/jira/browse/FLINK-8946 > > 2018-05-31 4:41 GMT+02:00 Nikolas Davis <nda...@newrelic.com>: > >> We keep track of metrics by using the value of >> MetricGroup::getMetricIdentifier, which returns the fully qualified >> metric name. The query that we use to monitor metrics filters for metrics >> IDs that match '%Status.JVM.Memory%'. As long as the new metrics come >> online via the MetricReporter interface then I think the chart would be >> continuous; we would just see the old JVM memory metrics cycle into new >> metrics. >> >> Nik Davis >> Software Engineer >> New Relic >> >> On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <aj...@yelp.com> wrote: >> >>> How are your metrics dimensionalized/named? Task managers often have >>> UIDs generated for them. The task id dimension will change on restart. If >>> you name your metric based on this 'task_id' there would be a discontinuity >>> with the old metric. >>> >>> On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <nda...@newrelic.com> >>> wrote: >>> >>>> Howdy, >>>> >>>> We are seeing our task manager JVM metrics disappear over time. This >>>> last time we correlated it to our job crashing and restarting. I wasn't >>>> able to grab the failing exception to share. Any thoughts? >>>> >>>> We track metrics through the MetricReporter interface. As far as I can >>>> tell this more or less only affects the JVM metrics. I.e. most / all other >>>> metrics continue reporting fine as the job is automatically restarted. >>>> >>>> Nik Davis >>>> Software Engineer >>>> New Relic >>>> >>> >>> >> > >