Re: Flink Metrics

Piotr Nowojski Wed, 03 Mar 2021 11:13:44 -0800

Hi,

1)
Do you want to output those metrics as Flink metrics? Or output those
"metrics"/counters as values to some external system (like Kafka)? The
problem discussed in [1], was that the metrics (Counters) were not fitting
in memory, so David suggested to hold them on Flink's state and treat the
measured values as regular output of the job.


The former option you can think of if you had a single operator, that
consumes your CDCs outputs something (filtered CDCs? processed CDCs?) to
Kafka, while keeping some metrics that you can access via Flink metrics
system. The latter would be the same operator, but instead of single output
it would have multiple outputs, writing the "counters" also for example to
Kafka (or any other system of your choice). Both options are viable, each
has its own pros and cons.

2) You need to persist your metrics somewhere. Why don't you use Flink's
state for that purpose? Upon recovery/initialisation, you can get the
recovered value from state and update/set metric value to that recovered
value.

3) That seems to be a question a bit unrelated to Flink. Try searching
online how to calculate percentiles. I haven't thought about it, but
histograms or sorting all of the values seems to be the options. Probably
best if you would use some existing library to do that for you.

4) Could you rephrase your question?

Best,
Piotrek

niedz., 28 lut 2021 o 14:53 Prasanna kumar <prasannakumarram...@gmail.com>
napisał(a):

> Hi flinksters,
>
> Scenario: We have cdc messages from our rdbms(various tables) flowing to
> Kafka.  Our flink job reads the CDC messages and creates events based on
> certain rules.
>
> I am using Prometheus  and grafana.
>
> Following are there metrics that i need to calculate
>
> A) Number of CDC messages wrt to each table.
> B) Number of events created wrt to each event type.
> C) Average/P99/P95 Latency (event created ts - ccd operation ts)
>
> For A and B, I created counters and able to see the metrices flowing into
> Prometheus . Few questions I have here.
>
> 1) How to create labels for counters in flink ? I did not find any easier
> method to do it . Right now I see that I need to create counters for each
> type of table and events . I referred to one of the community discussions.
> [1] . Is there any way apart from this ?
>
> 2) When the job gets restarted , the counters get back to 0 . How to
> prevent that and to get continuity.
>
> For C , I calculated latency in code for each event and assigned  it to
> histogram.  Few questions I have here.
>
> 3) I read in a few blogs [2] that histogram is the best way to get
> latencies. Is there any better idea?
>
> 4) How to create buckets for various ranges? I also read in a community
> email that flink implements  histogram as summaries.  I also should be able
> to see the latencies across timelines .
>
> [1]
> https://stackoverflow.com/questions/58456830/how-to-use-multiple-counters-in-flink
> [2] https://povilasv.me/prometheus-tracking-request-duration/
>
> Thanks,
> Prasanna.
>

Re: Flink Metrics

Reply via email to