Re: flink-metrics如何获取applicationid

2023-09-15 Thread im huzi
退订
On Wed, Aug 30, 2023 at 19:14 allanqinjy  wrote:

> hi,
>请教大家一个问题,就是在上报指标到prometheus时候,jobname会随机生成一个后缀,看源码也是new Abstract
> ID(),有方法在这里获取本次上报的作业applicationid吗?


Re: flink-metrics如何获取applicationid

2023-08-30 Thread Feng Jin
hi,

可以尝试获取下 _APP_ID  这个 JVM 环境变量.
System.getenv(YarnConfigKeys.ENV_APP_ID);

https://github.com/apache/flink/blob/6c9bb3716a3a92f3b5326558c6238432c669556d/flink-yarn/src/main/java/org/apache/flink/yarn/YarnConfigKeys.java#L28


Best,
Feng

On Wed, Aug 30, 2023 at 7:14 PM allanqinjy  wrote:

> hi,
>请教大家一个问题,就是在上报指标到prometheus时候,jobname会随机生成一个后缀,看源码也是new Abstract
> ID(),有方法在这里获取本次上报的作业applicationid吗?


Re: Flink metrics via permethous or opentelemerty

2022-02-24 Thread Nicolaus Weidner
Hi Sigalit,

first of all, have you read the docs page on metrics [1], and in particular
the Prometheus section on metrics reporters [2]?
Apart from that, there is also a (somewhat older) blog post about
integrating Flink with Prometheus, including a link to a repo with example
code [3].

Hope that helps to get you started!
Best,
Nico

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/metrics/#metrics
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/metric_reporters/#prometheus
[3] https://flink.apache.org/features/2019/03/11/prometheus-monitoring.html

On Wed, Feb 23, 2022 at 8:42 AM Sigalit Eliazov  wrote:

> Hello. I am looking for a way to expose flink metrics via opentelemerty to
> the gcp could monitoring dashboard.
> Does anyone has experience with that?
>
> If it is not directly possible we thought about using permethous as a
> middlewere.  If you have experience with that i would appreciate any
> guidance.
>
> Thanks
>


Re: Flink Metrics Naming

2021-06-01 Thread Chesnay Schepler

Some more background on MetricGroups:
Internally there (mostly) 3 types of metric groups:
On the one hand we have the ComponentMetricGroups (like 
TaskManagerMetricGroup) that describe a high-level Flink entity, which 
just add a constant expression to the logical scope(like taskmanager, 
task etc.). These exist to support scope formats (although this 
should've been implemented differently, but that's a another story).


On the other hand we have groups created via addGroup(String), which are 
added to the logical scope as is; this is sometimes good(e.g., 
addGroup("KafkaConsumer"), and sometimes isn't (e.g., 
addGroup().
Finally, there is a addGroup(String, String) variant, which behaves like 
a key-value pair (and similarly to the ComponentMetricGroup). The key 
part is added to the logical scope, and a label is usually added as well.


Due to historical reasons some parts in Flink use addGroup(String) 
despite the key-value pair variant being more appropriate; the latter 
was only added later, as was the logical scope as a whole for that matter.


With that said, the logical scope and labels suffer a bit due to being 
retrofitted on an existing design and some early mistakes in the metric 
structuring.
Ideally (imo), things would work like this (*bold *parts signify changes 
to the current behavior):
- addGroup(String) is *sparsely used* and only for high-level 
hierarchies (job, operator, source, kafka). It is added as is to the 
logical scope, creates no label, and is *excluded from the metric 
identifier*.
- addGroup(String, String) has *no effect on the logical scope*, creates 
a label, and is added as . to the metric identifier.


The core issue with these kind of changes however is backwards 
compatibility. We would have to do a sweep over the code-base to migrate 
inappropriate usages of addGroup(String) to the key-pair variant, 
probably remove some unnecessary groups (e.g., "Status" that is used for 
CPU metrics and whatnot) and finally make changes to the metric system 
internals, all of which need a codepath that retain the current behavior.


Simply put, for immediate needs I would probably encourage you do create 
a modified PrometheusReporter which determines the logical scope as you 
see fit; it could just ignore the logical scope entirely (although I'm 
not sure how well prometheus handles 1 metric having multiple instances 
with different label sets (e.g., numRecordsIn for operators/tasks), or 
exclude user-defined groups with something hacky like only using the 
first 4 parts of the logical scope.


On 6/1/2021 4:56 PM, Mason Chen wrote:
Upon further inspection, it seems like the user scope is not universal 
(i.e. comes through the connectors and not UDFs (like rich map 
function)), but the question still stands if the process makes sense.


On Jun 1, 2021, at 10:38 AM, Mason Chen > wrote:


Makes sense. We are primarily concerned with removing the metric 
labels from the names as the user metrics get too long. i.e. the 
groups from `addGroup` are concatenated in the metric name.


Do you think there would be any issues with removing the group 
information in the metric name and putting them into a label instead? 
In seems like most metrics internally, don’t use `addGroup` to create 
group information but rather by creating another subclass of metric 
group.


Perhaps, I should ONLY apply this custom logic to metrics with the 
“user” scope? Other scoped metrics (e.g. operator, task operator, 
etc.) shouldn’t have these group names in the metric names in my 
experience...


An example just for clarity, 
flink__group1_group2_metricName{group1=…, group2=…, 
flink tags}


=>
flink__metricName{group_info=group1_group2, group1=…, 
group2=…, flink tags}


On Jun 1, 2021, at 9:57 AM, Chesnay Schepler > wrote:


The uniqueness of metrics and the naming of the Prometheus reporter 
are somewhat related but also somewhat orthogonal.


Prometheus works similar to JMX in that the metric name (e.g., 
taskmanager.job.task.operator.numRecordsIn) is more or less a 
_class_ of metrics, with tags/labels allowing you to select a 
specific instance of that metric.


Restricting metric names to 1 level of the hierarchy would present a 
few issues:
a) Effectively, all metric names that Flink uses effectively become 
reserved keywords that users must not use, which will lead to 
headaches when adding more metrics or forwarding metrics from 
libraries (e.g., kafka), because we could always break existing 
user-defined metrics.
b) You'd need a cluster-wide lookup that is aware of all hierarchies 
to ensure consistency across all processes.


In the end, there are significantly easier ways to solve the issue 
of the metric name being too long, i.e., give the user more control 
over the logical scope (taskmanager.job.task.operator), be it 
shortening the names (t.j.t.o), limiting the depth (e.g, 
operator.numRecordsIn), removing it outright (but I'd prefer 

Re: Flink Metrics Naming

2021-06-01 Thread Mason Chen
Upon further inspection, it seems like the user scope is not universal (i.e. 
comes through the connectors and not UDFs (like rich map function)), but the 
question still stands if the process makes sense.

> On Jun 1, 2021, at 10:38 AM, Mason Chen  wrote:
> 
> Makes sense. We are primarily concerned with removing the metric labels from 
> the names as the user metrics get too long. i.e. the groups from `addGroup` 
> are concatenated in the metric name.
> 
> Do you think there would be any issues with removing the group information in 
> the metric name and putting them into a label instead? In seems like most 
> metrics internally, don’t use `addGroup` to create group information but 
> rather by creating another subclass of metric group.
> 
> Perhaps, I should ONLY apply this custom logic to metrics with the “user” 
> scope? Other scoped metrics (e.g. operator, task operator, etc.) shouldn’t 
> have these group names in the metric names in my experience...
> 
> An example just for clarity, 
> flink__group1_group2_metricName{group1=…, group2=…, flink tags}
> 
> =>
> 
> flink__metricName{group_info=group1_group2, group1=…, group2=…, 
> flink tags}
> 
>> On Jun 1, 2021, at 9:57 AM, Chesnay Schepler > > wrote:
>> 
>> The uniqueness of metrics and the naming of the Prometheus reporter are 
>> somewhat related but also somewhat orthogonal.
>> 
>> Prometheus works similar to JMX in that the metric name (e.g., 
>> taskmanager.job.task.operator.numRecordsIn) is more or less a _class_ of 
>> metrics, with tags/labels allowing you to select a specific instance of that 
>> metric.
>> 
>> Restricting metric names to 1 level of the hierarchy would present a few 
>> issues:
>> a) Effectively, all metric names that Flink uses effectively become reserved 
>> keywords that users must not use, which will lead to headaches when adding 
>> more metrics or forwarding metrics from libraries (e.g., kafka), because we 
>> could always break existing user-defined metrics.
>> b) You'd need a cluster-wide lookup that is aware of all hierarchies to 
>> ensure consistency across all processes.
>> 
>> In the end, there are significantly easier ways to solve the issue of the 
>> metric name being too long, i.e., give the user more control over the 
>> logical scope (taskmanager.job.task.operator), be it shortening the names 
>> (t.j.t.o), limiting the depth (e.g, operator.numRecordsIn), removing it 
>> outright (but I'd prefer some context to be present for clarity) or 
>> supporting something similar to scope formats.
>> I'm reasonably certain there are some tickets already in this direction, we 
>> just don't get around to doing them because for the most part the metric 
>> system works good enough and there are bigger fish to fry.
>> 
>> On 6/1/2021 3:39 PM, Till Rohrmann wrote:
>>> Hi Mason,
>>> 
>>> The idea is that a metric is not uniquely identified by its name alone but 
>>> instead by its path. The groups in which it is defined specify this path 
>>> (similar to directories). That's why it is valid to specify two metrics 
>>> with the same name if they reside in different groups.
>>> 
>>> I think Prometheus does not support such a tree structure and that's why 
>>> the path is exposed via labels if I am not mistaken. So long story short, 
>>> what you are seeing is a combination of how Flink organizes metrics and 
>>> what can be reported to Prometheus. 
>>> 
>>> I am also pulling in Chesnay who is more familiar with this part of the 
>>> code.
>>> 
>>> Cheers,
>>> Till
>>> 
>>> On Fri, May 28, 2021 at 7:33 PM Mason Chen >> > wrote:
>>> Can anyone give insight as to why Flink allows 2 metrics with the same 
>>> “name”?
>>> 
>>> For example,
>>> 
>>> getRuntimeContext.addGroup(“group”, “group1”).counter(“myMetricName”);
>>> 
>>> And
>>> 
>>> getRuntimeContext.addGroup(“other_group”, 
>>> “other_group1”).counter(“myMetricName”);
>>> 
>>> Are totally valid.
>>> 
>>> 
>>> It seems that it has lead to some not-so-great implementations—the 
>>> prometheus reporter and attaching the labels to the metric name, making the 
>>> name quite verbose.
>>> 
>>> 
>> 
> 



Re: Flink Metrics Naming

2021-06-01 Thread Mason Chen
Makes sense. We are primarily concerned with removing the metric labels from 
the names as the user metrics get too long. i.e. the groups from `addGroup` are 
concatenated in the metric name.

Do you think there would be any issues with removing the group information in 
the metric name and putting them into a label instead? In seems like most 
metrics internally, don’t use `addGroup` to create group information but rather 
by creating another subclass of metric group.

Perhaps, I should ONLY apply this custom logic to metrics with the “user” 
scope? Other scoped metrics (e.g. operator, task operator, etc.) shouldn’t have 
these group names in the metric names in my experience...

An example just for clarity, 
flink__group1_group2_metricName{group1=…, group2=…, flink tags}

=>

flink__metricName{group_info=group1_group2, group1=…, group2=…, 
flink tags}

> On Jun 1, 2021, at 9:57 AM, Chesnay Schepler  wrote:
> 
> The uniqueness of metrics and the naming of the Prometheus reporter are 
> somewhat related but also somewhat orthogonal.
> 
> Prometheus works similar to JMX in that the metric name (e.g., 
> taskmanager.job.task.operator.numRecordsIn) is more or less a _class_ of 
> metrics, with tags/labels allowing you to select a specific instance of that 
> metric.
> 
> Restricting metric names to 1 level of the hierarchy would present a few 
> issues:
> a) Effectively, all metric names that Flink uses effectively become reserved 
> keywords that users must not use, which will lead to headaches when adding 
> more metrics or forwarding metrics from libraries (e.g., kafka), because we 
> could always break existing user-defined metrics.
> b) You'd need a cluster-wide lookup that is aware of all hierarchies to 
> ensure consistency across all processes.
> 
> In the end, there are significantly easier ways to solve the issue of the 
> metric name being too long, i.e., give the user more control over the logical 
> scope (taskmanager.job.task.operator), be it shortening the names (t.j.t.o), 
> limiting the depth (e.g, operator.numRecordsIn), removing it outright (but 
> I'd prefer some context to be present for clarity) or supporting something 
> similar to scope formats.
> I'm reasonably certain there are some tickets already in this direction, we 
> just don't get around to doing them because for the most part the metric 
> system works good enough and there are bigger fish to fry.
> 
> On 6/1/2021 3:39 PM, Till Rohrmann wrote:
>> Hi Mason,
>> 
>> The idea is that a metric is not uniquely identified by its name alone but 
>> instead by its path. The groups in which it is defined specify this path 
>> (similar to directories). That's why it is valid to specify two metrics with 
>> the same name if they reside in different groups.
>> 
>> I think Prometheus does not support such a tree structure and that's why the 
>> path is exposed via labels if I am not mistaken. So long story short, what 
>> you are seeing is a combination of how Flink organizes metrics and what can 
>> be reported to Prometheus. 
>> 
>> I am also pulling in Chesnay who is more familiar with this part of the code.
>> 
>> Cheers,
>> Till
>> 
>> On Fri, May 28, 2021 at 7:33 PM Mason Chen > > wrote:
>> Can anyone give insight as to why Flink allows 2 metrics with the same 
>> “name”?
>> 
>> For example,
>> 
>> getRuntimeContext.addGroup(“group”, “group1”).counter(“myMetricName”);
>> 
>> And
>> 
>> getRuntimeContext.addGroup(“other_group”, 
>> “other_group1”).counter(“myMetricName”);
>> 
>> Are totally valid.
>> 
>> 
>> It seems that it has lead to some not-so-great implementations—the 
>> prometheus reporter and attaching the labels to the metric name, making the 
>> name quite verbose.
>> 
>> 
> 



Re: Flink Metrics Naming

2021-06-01 Thread Chesnay Schepler
The uniqueness of metrics and the naming of the Prometheus reporter are 
somewhat related but also somewhat orthogonal.


Prometheus works similar to JMX in that the metric name (e.g., 
taskmanager.job.task.operator.numRecordsIn) is more or less a _class_ of 
metrics, with tags/labels allowing you to select a specific instance of 
that metric.


Restricting metric names to 1 level of the hierarchy would present a few 
issues:
a) Effectively, all metric names that Flink uses effectively become 
reserved keywords that users must not use, which will lead to headaches 
when adding more metrics or forwarding metrics from libraries (e.g., 
kafka), because we could always break existing user-defined metrics.
b) You'd need a cluster-wide lookup that is aware of all hierarchies to 
ensure consistency across all processes.


In the end, there are significantly easier ways to solve the issue of 
the metric name being too long, i.e., give the user more control over 
the logical scope (taskmanager.job.task.operator), be it shortening the 
names (t.j.t.o), limiting the depth (e.g, operator.numRecordsIn), 
removing it outright (but I'd prefer some context to be present for 
clarity) or supporting something similar to scope formats.
I'm reasonably certain there are some tickets already in this direction, 
we just don't get around to doing them because for the most part the 
metric system works good enough and there are bigger fish to fry.


On 6/1/2021 3:39 PM, Till Rohrmann wrote:

Hi Mason,

The idea is that a metric is not uniquely identified by its name alone 
but instead by its path. The groups in which it is defined specify 
this path (similar to directories). That's why it is valid to specify 
two metrics with the same name if they reside in different groups.


I think Prometheus does not support such a tree structure and that's 
why the path is exposed via labels if I am not mistaken. So long story 
short, what you are seeing is a combination of how Flink organizes 
metrics and what can be reported to Prometheus.


I am also pulling in Chesnay who is more familiar with this part of 
the code.


Cheers,
Till

On Fri, May 28, 2021 at 7:33 PM Mason Chen > wrote:


Can anyone give insight as to why Flink allows 2 metrics with the
same “name”?

For example,

getRuntimeContext.addGroup(“group”, “group1”).counter(“myMetricName”);

And

getRuntimeContext.addGroup(“other_group”,
“other_group1”).counter(“myMetricName”);

Are totally valid.


It seems that it has lead to some not-so-great implementations—the
prometheus reporter and attaching the labels to the metric name,
making the name quite verbose.






Re: Flink Metrics Naming

2021-06-01 Thread Till Rohrmann
Hi Mason,

The idea is that a metric is not uniquely identified by its name alone but
instead by its path. The groups in which it is defined specify this path
(similar to directories). That's why it is valid to specify two metrics
with the same name if they reside in different groups.

I think Prometheus does not support such a tree structure and that's why
the path is exposed via labels if I am not mistaken. So long story short,
what you are seeing is a combination of how Flink organizes metrics and
what can be reported to Prometheus.

I am also pulling in Chesnay who is more familiar with this part of the
code.

Cheers,
Till

On Fri, May 28, 2021 at 7:33 PM Mason Chen  wrote:

> Can anyone give insight as to why Flink allows 2 metrics with the same
> “name”?
>
> For example,
>
> getRuntimeContext.addGroup(“group”, “group1”).counter(“myMetricName”);
>
> And
>
> getRuntimeContext.addGroup(“other_group”,
> “other_group1”).counter(“myMetricName”);
>
> Are totally valid.
>
>
> It seems that it has lead to some not-so-great implementations—the
> prometheus reporter and attaching the labels to the metric name, making the
> name quite verbose.
>
>
>


Re: Flink Metrics emitted from a Kubernetes Application Cluster

2021-04-09 Thread Chesnay Schepler

This is currently not possible. See also FLINK-8358

On 4/9/2021 4:47 AM, Claude M wrote:

Hello,

I've setup Flink as an Application Cluster in Kubernetes. Now I'm 
looking into monitoring the Flink cluster in Datadog. This is what is 
configured in the flink-conf.yaml to emit metrics:


metrics.scope.jm : flink.jobmanager
metrics.scope.jm.job: flink.jobmanager.job
metrics.scope.tm : flink.taskmanager
metrics.scope.tm.job: flink.taskmanager.job
metrics.scope.task: flink.task
metrics.scope.operator: flink.operator
metrics.reporter.dghttp.class: 
org.apache.flink.metrics.datadog.DatadogHttpReporter

metrics.reporter.dghttp.apikey: {{ datadog_api_key }}
metrics.reporter.dghttp.tags: environment: {{ environment }}

When it gets to Datadog though, the metrics for the flink.jobmanager 
and flink.taskmanager is filtered by the host which is the Pod IP.  
However, I would like it to use the pod name.  How can this be 
accomplished?



Thanks





Re: Flink Metrics

2021-03-03 Thread Piotr Nowojski
Hi,

1)
Do you want to output those metrics as Flink metrics? Or output those
"metrics"/counters as values to some external system (like Kafka)? The
problem discussed in [1], was that the metrics (Counters) were not fitting
in memory, so David suggested to hold them on Flink's state and treat the
measured values as regular output of the job.

The former option you can think of if you had a single operator, that
consumes your CDCs outputs something (filtered CDCs? processed CDCs?) to
Kafka, while keeping some metrics that you can access via Flink metrics
system. The latter would be the same operator, but instead of single output
it would have multiple outputs, writing the "counters" also for example to
Kafka (or any other system of your choice). Both options are viable, each
has its own pros and cons.

2) You need to persist your metrics somewhere. Why don't you use Flink's
state for that purpose? Upon recovery/initialisation, you can get the
recovered value from state and update/set metric value to that recovered
value.

3) That seems to be a question a bit unrelated to Flink. Try searching
online how to calculate percentiles. I haven't thought about it, but
histograms or sorting all of the values seems to be the options. Probably
best if you would use some existing library to do that for you.

4) Could you rephrase your question?

Best,
Piotrek

niedz., 28 lut 2021 o 14:53 Prasanna kumar 
napisał(a):

> Hi flinksters,
>
> Scenario: We have cdc messages from our rdbms(various tables) flowing to
> Kafka.  Our flink job reads the CDC messages and creates events based on
> certain rules.
>
> I am using Prometheus  and grafana.
>
> Following are there metrics that i need to calculate
>
> A) Number of CDC messages wrt to each table.
> B) Number of events created wrt to each event type.
> C) Average/P99/P95 Latency (event created ts - ccd operation ts)
>
> For A and B, I created counters and able to see the metrices flowing into
> Prometheus . Few questions I have here.
>
> 1) How to create labels for counters in flink ? I did not find any easier
> method to do it . Right now I see that I need to create counters for each
> type of table and events . I referred to one of the community discussions.
> [1] . Is there any way apart from this ?
>
> 2) When the job gets restarted , the counters get back to 0 . How to
> prevent that and to get continuity.
>
> For C , I calculated latency in code for each event and assigned  it to
> histogram.  Few questions I have here.
>
> 3) I read in a few blogs [2] that histogram is the best way to get
> latencies. Is there any better idea?
>
> 4) How to create buckets for various ranges? I also read in a community
> email that flink implements  histogram as summaries.  I also should be able
> to see the latencies across timelines .
>
> [1]
> https://stackoverflow.com/questions/58456830/how-to-use-multiple-counters-in-flink
> [2] https://povilasv.me/prometheus-tracking-request-duration/
>
> Thanks,
> Prasanna.
>


Re: Flink Metrics in kubernetes

2020-05-13 Thread Averell
Hi Gary,

Sorry for the false alarm. It's caused by a bug in my deployment - no
metrics were added into the registry.
Sorry for wasting your time.

Thanks and best regards,
Averell 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink Metrics in kubernetes

2020-05-12 Thread Averell
Hi Gary,

Thanks for the help.
Here below is the output from jstack. It seems not being blocked. 



In my JobManager log, there's this WARN, I am not sure whether it's relevant
at all.


Attached is the full jstack dump  k8xDump.txt

 
.

Thanks and regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink Metrics in kubernetes

2020-05-12 Thread Gary Yao
Hi Averell,

If you are seeing the log message from [1] and Scheduled#report() is
not called, the thread in the "Flink-MetricRegistry" thread pool might
be blocked. You can use the jstack utility to see on which task the
thread pool is blocked.

Best,
Gary

[1] 
https://github.com/apache/flink/blob/e346215edcf2252cc60c5cef507ea77ce2ac9aca/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java#L141

On Tue, May 12, 2020 at 4:32 PM Averell  wrote:
>
> Hi,
>
> I'm trying to config Flink running in Kubernetes native to push some metrics
> to NewRelic (using a custom ScheduledDropwizardReporter).
>
> From the logs, I could see that an instance of ScheduledDropwizardReporter
> has already been created successfully (the overridden  getReporter() method
> 
> was called).
> An instance of  MetricRegistryImpl
> 
> also created successfully (this log was shown: /Periodically reporting
> metrics in intervals of 30 SECONDS for reporter my_newrelic_reporter/)
>
> However, the  report() method
> 
> was not called.
>
> When running on my laptop, there's no issue at all.
> Are there any special things that I need to care for when running in
> Kubernetes?
>
> Thanks a lot.
>
> Regards,
> Averell
>
>
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink Metrics - PrometheusReporter

2020-01-22 Thread Sidney Feiner
Ok, I configured the PrometheusReporter's ports to be a range and now every 
TaskManager has it's own port where I can see it's metrics. Thank you very much!


Sidney Feiner / Data Platform Developer
M: +972.528197720 / Skype: sidney.feiner.startapp

[emailsignature]



From: Chesnay Schepler 
Sent: Wednesday, January 22, 2020 6:07 PM
To: Sidney Feiner ; flink-u...@apache.org 

Subject: Re: Flink Metrics - PrometheusReporter

Metrics are exposed via reporters by each process separately, whereas the WebUI 
aggregates metrics.

As such you have to configure Prometheus to also scrape the TaskExecutors.

On 22/01/2020 16:58, Sidney Feiner wrote:
Hey,
I've been trying to use the PrometheusReporter and when I used in locally on my 
computer, I would access the port I configured and see all the metrics I've 
created.
In production, we use High Availability mode and when I try to access the 
JobManager's metrics in the port I've configured on the PrometheusReporter, I 
see some very basic metrics - default Flink metrics, but I can't see any of my 
custom metrics.

Weird thing is I can see those metrics through Flink's UI in the Metrics tab:
[cid:part1.8D6219CF.AA6B4229@apache.org]

Does anybody have a clue why my custom metrics are configured but not being 
reported in high availability but are reported when I run the job locally 
though IntelliJ?

Thanks 



Sidney Feiner / Data Platform Developer
M: +972.528197720 / Skype: sidney.feiner.startapp

[emailsignature]




Re: Flink Metrics - PrometheusReporter

2020-01-22 Thread Chesnay Schepler
Metrics are exposed via reporters by each process separately, whereas 
the WebUI aggregates metrics.


As such you have to configure Prometheus to also scrape the TaskExecutors.

On 22/01/2020 16:58, Sidney Feiner wrote:

Hey,
I've been trying to use the PrometheusReporter and when I used in 
locally on my computer, I would access the port I configured and see 
all the metrics I've created.
In production, we use High Availability mode and when I try to access 
the JobManager's metrics in the port I've configured on the 
PrometheusReporter, I see some very basic metrics - default Flink 
metrics, but I can't see any of my custom metrics.


Weird thing is I can see those metrics through Flink's UI in the 
Metrics tab:



Does anybody have a clue why my custom metrics are configured but not 
being reported in high availability but are reported when I run the 
job locally though IntelliJ?


Thanks 



*Sidney Feiner**/*Data Platform Developer
M: +972.528197720 */*Skype: sidney.feiner.startapp
emailsignature





Re: Flink metrics reporters documentation

2019-10-10 Thread Aleksey Pak
Hi Flavio,

Below is my explanation to your question, based on anecdotal evidence:

As you may know, Flink distribution package is already scala version
specific and bundles some jar artifacts.
User Flink job is supposed to be compiled against some of those jars (with
maven's `provided` scope). For example, it can be Flink CEP library.
In such cases, jar names are usually preserved as is (so you would
reference the same artifact dependency name in your application build and
when you want to copy it from `/opt` to `/lib` folder).

Some of the jars are not supposed to be used by your application directly,
but rather as "plugins" in your Flink cluster (here I mean "plugins" in a
more broader sense, than plugins mechanism used by file systems introduced
in Flink 1.9).
File systems, metrics reporters are good candidates for this. The reason
that original jar artifacts are scala version specific is rather
"incidental" (imo) - it just happens that they may depend on some core
Flink libraries that still have scala code.
In practice the implementation of those libraries is not scala dependent,
but to be strict (and safe) they are built separately for different scala
versions (what you see in the maven central).

My understanding, that one of the goals to move scala away from core
libraries (to some api level library) - this should make some of the
component builds scala independent.
Removal of scala version for those jars in the distribution is probably
done with the future plan in mind (so that it stays the same user
experience).

Regards,
Aleksey


On Thu, Oct 10, 2019 at 10:59 AM Flavio Pompermaier 
wrote:

> Sorry,
> I just discovered that those jars are actually in the opt folder within
> Flink dist..however the second point still holds: why there's a single
> influxdb jar inside flink's opt jar while on maven there are 2 versions
> (one for scala 2.11 and one for 2.12)?
>
> Best,
> Flavio
>
> On Thu, Oct 10, 2019 at 10:49 AM Flavio Pompermaier 
> wrote:
>
>> Hi to all,
>> I was trying to configure monitoring on my cluster so I went to the
>> metric reporters documentation.
>> There are 2 things that are not clear to me:
>>
>>1. In all reporters the documentation says to take the jars from /opt
>>folder..obviously this is not true. Wouldn't be better to provide a link 
>> to
>>the jar directly (on Maven Central for example)?
>>2. If you look to influxdb dependency the documentation says to use
>>flink-metrics-influxdb-1.9.0.jar but there's no such "unified" jar, on
>>maven central there are two version: 1 for scala 2.11 and one for scala 
>> 2.12
>>
>> Should I open 2 JIRA tickets to improve those 2 aspects (if I'm not
>> wrong..)?
>>
>> Best,
>> Flavio
>>
>
>


Re: Flink metrics reporters documentation

2019-10-10 Thread Flavio Pompermaier
Sorry,
I just discovered that those jars are actually in the opt folder within
Flink dist..however the second point still holds: why there's a single
influxdb jar inside flink's opt jar while on maven there are 2 versions
(one for scala 2.11 and one for 2.12)?

Best,
Flavio

On Thu, Oct 10, 2019 at 10:49 AM Flavio Pompermaier 
wrote:

> Hi to all,
> I was trying to configure monitoring on my cluster so I went to the metric
> reporters documentation.
> There are 2 things that are not clear to me:
>
>1. In all reporters the documentation says to take the jars from /opt
>folder..obviously this is not true. Wouldn't be better to provide a link to
>the jar directly (on Maven Central for example)?
>2. If you look to influxdb dependency the documentation says to use
>flink-metrics-influxdb-1.9.0.jar but there's no such "unified" jar, on
>maven central there are two version: 1 for scala 2.11 and one for scala 
> 2.12
>
> Should I open 2 JIRA tickets to improve those 2 aspects (if I'm not
> wrong..)?
>
> Best,
> Flavio
>


Re: Flink metrics scope for YARN single job

2019-08-15 Thread Vasily Melnik
Hi Biao!

>  Do you mean "distinguish metrics from different JobManager running on
same host"?
Exactly.

>Will give you a feedback if there is a conclusion.
Thanks!



On Thu, 15 Aug 2019 at 06:40, Biao Liu  wrote:

> Hi Vasily,
>
> > Is there any way to distinguish logs from different JobManager running
> on same host?
>
> Do you mean "distinguish metrics from different JobManager running on
> same host"?
> I guess there is no other variable you could use for now.
>
> But I think it's reasonable to support this requirement. I would like to
> discuss with the devs to hear their opinions. Will give you a feedback if
> there is a conclusion.
>
> Thanks,
> Biao /'bɪ.aʊ/
>
>
>
> On Wed, 14 Aug 2019 at 19:46, Vasily Melnik <
> vasily.mel...@glowbyteconsulting.com> wrote:
>
>> Hi,
>> I want to run Flink apps on YARN in single job mode and keep metrics in
>> Graphite. But as i see, the only variable i can use for JobManager scope
>> customization is :
>> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#list-of-all-variables
>>
>> Is there any way to distinguish logs from different JobManager running on
>> same host?
>>
>>
>> Thanks in advance.
>>
>


Re: Flink metrics scope for YARN single job

2019-08-14 Thread Biao Liu
Hi Vasily,

> Is there any way to distinguish logs from different JobManager running on
same host?

Do you mean "distinguish metrics from different JobManager running on same
host"?
I guess there is no other variable you could use for now.

But I think it's reasonable to support this requirement. I would like to
discuss with the devs to hear their opinions. Will give you a feedback if
there is a conclusion.

Thanks,
Biao /'bɪ.aʊ/



On Wed, 14 Aug 2019 at 19:46, Vasily Melnik <
vasily.mel...@glowbyteconsulting.com> wrote:

> Hi,
> I want to run Flink apps on YARN in single job mode and keep metrics in
> Graphite. But as i see, the only variable i can use for JobManager scope
> customization is :
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#list-of-all-variables
>
> Is there any way to distinguish logs from different JobManager running on
> same host?
>
>
> Thanks in advance.
>


Re: flink metrics的 Reporter 问题

2019-05-15 Thread Xintong Song
取hostname的第一部分是为了和hdfs的用法保持一致,可以参考一下当时的issue,作者专门提到了为什么要这么做。
https://issues.apache.org/jira/browse/FLINK-1170?focusedCommentId=14175285=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14175285

Thank you~

Xintong Song



On Wed, May 15, 2019 at 9:11 PM Yun Tang  wrote:

> Hi 嘉诚
>
> 不清楚你使用的Flink具体版本,不过这个显示host-name第一部分的逻辑是一直存在的,因为大部分场景下host-name只需要取第一部分即可表征。具体实现代码可以参阅
> [1] 和 [2] 。
>
> 受到你的启发,我创建了一个JIRA [3] 来追踪这个问题,解法是提供一个metrics
> options,使得你们场景下可以展示metrics的完整hostname
>
> 祝好
> 唐云
>
>
> [1]
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L365
> [2]
> https://github.com/apache/flink/blob/505b54c182867ccac5d1724d72f4085425ac08a8/flink-core/src/main/java/org/apache/flink/util/NetUtils.java#L59
> [3] https://issues.apache.org/jira/browse/FLINK-12520
> 
> From: 戴嘉诚 
> Sent: Wednesday, May 15, 2019 20:24
> To: user-zh@flink.apache.org
> Subject: flink metrics的 Reporter 问题
>
> 大家好:
> 我按照官网的文档,调试了flink metrics 的 reporter
> ,加载了Slf4jReporter,这个Reporter运行是正常了,但是发现了个问题,
> 在taskManager中打印里面的信息的时候,打印出来的是:
> ambari.taskmanager.container_e31_1557826320302_0005_01_02.Status.JVM.ClassLoader.ClassesLoaded:
> 12044
> 这里的格式范围,我看了源码应该是.taskmanager..:
>
>
> 但是这里就存在了个问题了,这里的host,显示的是ambari,我服务器上配置的计算机名称应该是全量的ambari.host12.yy,这里的host把后面的给全部省略掉了。这样,我就无法判断这条记录是来自哪个机器了。
>
> 同时,我在jobManager中看到jobmanager打印出来的日志中,是一个全量的机器名称,如下:
> ambari.host02.yy.jobmanager.Status.JVM.Memory.NonHeap.Max: -1
>
> 请问如果我要在taskmanager的reporter中获取到全量的机器名称,我这里需要如何处理?这里是一个bug吗?还是我的使用有误
>


Re: flink metrics的 Reporter 问题

2019-05-15 Thread Yun Tang
Hi 嘉诚

不清楚你使用的Flink具体版本,不过这个显示host-name第一部分的逻辑是一直存在的,因为大部分场景下host-name只需要取第一部分即可表征。具体实现代码可以参阅
 [1] 和 [2] 。

受到你的启发,我创建了一个JIRA [3] 来追踪这个问题,解法是提供一个metrics 
options,使得你们场景下可以展示metrics的完整hostname

祝好
唐云


[1] 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L365
[2] 
https://github.com/apache/flink/blob/505b54c182867ccac5d1724d72f4085425ac08a8/flink-core/src/main/java/org/apache/flink/util/NetUtils.java#L59
[3] https://issues.apache.org/jira/browse/FLINK-12520

From: 戴嘉诚 
Sent: Wednesday, May 15, 2019 20:24
To: user-zh@flink.apache.org
Subject: flink metrics的 Reporter 问题

大家好:
我按照官网的文档,调试了flink metrics 的 reporter 
,加载了Slf4jReporter,这个Reporter运行是正常了,但是发现了个问题,
在taskManager中打印里面的信息的时候,打印出来的是:
ambari.taskmanager.container_e31_1557826320302_0005_01_02.Status.JVM.ClassLoader.ClassesLoaded:
 12044
这里的格式范围,我看了源码应该是.taskmanager..:

但是这里就存在了个问题了,这里的host,显示的是ambari,我服务器上配置的计算机名称应该是全量的ambari.host12.yy,这里的host把后面的给全部省略掉了。这样,我就无法判断这条记录是来自哪个机器了。

同时,我在jobManager中看到jobmanager打印出来的日志中,是一个全量的机器名称,如下:
ambari.host02.yy.jobmanager.Status.JVM.Memory.NonHeap.Max: -1

请问如果我要在taskmanager的reporter中获取到全量的机器名称,我这里需要如何处理?这里是一个bug吗?还是我的使用有误


Re: Flink Metrics

2019-04-18 Thread Zhu Zhu
Hi Brian,

You can implement a new org.apache.flink.metrics.reporter.MetricReporter as
you like and register it to flink in flink conf.

e.g.

metrics.reporters:my_reporter
metrics.reporter.my_other_reporter.class: xxx
metrics.reporter.my_other_reporter.config1: yyy
metrics.reporter.my_other_reporter.config2: zzz


Thanks,
Zhu



Brian Ramprasad  于2019年4月18日周四 下午2:37写道:

> Hi,
>
> I am trying to profile my Flink job.  For example I want to output the
> results of the TaskIOMetricGroup to a log file. Does anyone know if there
> is a way to access this object at runtime and execute the methods to get
> the data from within my user code that I submit to the Flink to start a job?
>
>
>
> Thanks
> Brian R


Re: Flink metrics missing from UI 1.7.2

2019-03-23 Thread Padarn Wilson
Aha! This is almost certainly it. I remembered thinking something like this
might be a problem. I'll need to change the deployment a bit to add this
(not straightforward to edit the YAML in my case, but thanks!

On Sun, Mar 24, 2019 at 10:01 AM dawid <
apache-flink-user-mailing-list-arch...@davidhaglund.se> wrote:

> Padarn Wilson-2 wrote
> > I am running Fink 1.7.2 on Kubernetes in a setup with task manager and
> job
> > manager separate.
> >
> > I'm having trouble seeing the metrics from my Flink job in the UI
> > dashboard. Actually I'm using the Datadog reporter to expose most of my
> > metrics, but latency tracking does not seem to be exported.
> >
> > Is there anything extra that needs to be enabled to make sure metrics are
> > exported and viewable to the Flink UI?
>
> With Flink 1.7 on Kubernetes you need to make sure the task managers are
> registering to the job manager with their IP addresses and not the
> hostnames, see the taskmanager-deployment.yaml manifest in [1], with the
> K8S_POD_IP environment variable and setting
> -Dtaskmanager.host=$(K8S_POD_IP).
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/kubernetes.html#appendix
>
> /David
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>


Re: Flink metrics missing from UI 1.7.2

2019-03-23 Thread dawid
Padarn Wilson-2 wrote
> I am running Fink 1.7.2 on Kubernetes in a setup with task manager and job
> manager separate.
> 
> I'm having trouble seeing the metrics from my Flink job in the UI
> dashboard. Actually I'm using the Datadog reporter to expose most of my
> metrics, but latency tracking does not seem to be exported.
> 
> Is there anything extra that needs to be enabled to make sure metrics are
> exported and viewable to the Flink UI?

With Flink 1.7 on Kubernetes you need to make sure the task managers are
registering to the job manager with their IP addresses and not the
hostnames, see the taskmanager-deployment.yaml manifest in [1], with the
K8S_POD_IP environment variable and setting
-Dtaskmanager.host=$(K8S_POD_IP). 

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/kubernetes.html#appendix

/David



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink metrics missing from UI 1.7.2

2019-03-23 Thread Padarn Wilson
Thanks David. I cannot see the metrics there, so let me play around a bit
more and make sure they are enabled correctly.

On Sat, Mar 23, 2019 at 9:19 PM David Anderson  wrote:

> > I have done this (actually I do it in my flink-conf.yaml), but I am not
> seeing any metrics at all in the Flink UI,
> > let alone the latency tracking. The latency tracking itself does not
> seem to be exported to datadog (should it be?)
>
> The latency metrics are job metrics, and are not shown in the Flink UI.
> They are available via the REST API, and I believe they should also be
> exported to datadog. You will find them at
>
> http://localhost:8081/jobs//metrics
>
> with IDs like
>
>
> latency.source_id.bc764cd8ddf7a0cff126f51c16239658.operator_id.ea632d67b7d595e5b851708ae9ad79d6.operator_subtask_index.0.latency_p90
>
> On Sat, Mar 23, 2019 at 1:53 PM Padarn Wilson  wrote:
>
>> Thanks David.
>>
>> I have done this (actually I do it in my flink-conf.yaml), but I am not
>> seeing any metrics at all in the Flink UI, let alone the latency tracking.
>> The latency tracking itself does not seem to be exported to datadog (should
>> it be?)
>>
>>
>>
>> On Sat, Mar 23, 2019 at 8:43 PM David Anderson 
>> wrote:
>>
>>> Because latency tracking is expensive, it is turned off by default. You
>>> turn it on by setting the interval; that looks something like this:
>>>
>>> env.getConfig().setLatencyTrackingInterval(1000);
>>>
>>> The full set of configuration options is described in the docs:
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#metrics
>>>
>>> David Anderson | Training Coordinator
>>> Follow us @VervericaData
>>>
>>> --
>>> Join Flink Forward - The Apache Flink Conference
>>> Stream Processing | Event Driven | Real Time
>>>
>>>
>>> On Sat, Mar 23, 2019 at 1:03 PM Padarn Wilson  wrote:
>>> >
>>> > Hi User,
>>> >
>>> > I am running Fink 1.7.2 on Kubernetes in a setup with task manager and
>>> job manager separate.
>>> >
>>> > I'm having trouble seeing the metrics from my Flink job in the UI
>>> dashboard. Actually I'm using the Datadog reporter to expose most of my
>>> metrics, but latency tracking does not seem to be exported.
>>> >
>>> > Is there anything extra that needs to be enabled to make sure metrics
>>> are exported and viewable to the Flink UI?
>>> >
>>> > Thanks
>>>



Re: Flink metrics missing from UI 1.7.2

2019-03-23 Thread David Anderson
> I have done this (actually I do it in my flink-conf.yaml), but I am not
seeing any metrics at all in the Flink UI,
> let alone the latency tracking. The latency tracking itself does not seem
to be exported to datadog (should it be?)

The latency metrics are job metrics, and are not shown in the Flink UI.
They are available via the REST API, and I believe they should also be
exported to datadog. You will find them at

http://localhost:8081/jobs//metrics

with IDs like


latency.source_id.bc764cd8ddf7a0cff126f51c16239658.operator_id.ea632d67b7d595e5b851708ae9ad79d6.operator_subtask_index.0.latency_p90

On Sat, Mar 23, 2019 at 1:53 PM Padarn Wilson  wrote:

> Thanks David.
>
> I have done this (actually I do it in my flink-conf.yaml), but I am not
> seeing any metrics at all in the Flink UI, let alone the latency tracking.
> The latency tracking itself does not seem to be exported to datadog (should
> it be?)
>
>
>
> On Sat, Mar 23, 2019 at 8:43 PM David Anderson 
> wrote:
>
>> Because latency tracking is expensive, it is turned off by default. You
>> turn it on by setting the interval; that looks something like this:
>>
>> env.getConfig().setLatencyTrackingInterval(1000);
>>
>> The full set of configuration options is described in the docs:
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#metrics
>>
>> David Anderson | Training Coordinator
>> Follow us @VervericaData
>>
>> --
>> Join Flink Forward - The Apache Flink Conference
>> Stream Processing | Event Driven | Real Time
>>
>>
>> On Sat, Mar 23, 2019 at 1:03 PM Padarn Wilson  wrote:
>> >
>> > Hi User,
>> >
>> > I am running Fink 1.7.2 on Kubernetes in a setup with task manager and
>> job manager separate.
>> >
>> > I'm having trouble seeing the metrics from my Flink job in the UI
>> dashboard. Actually I'm using the Datadog reporter to expose most of my
>> metrics, but latency tracking does not seem to be exported.
>> >
>> > Is there anything extra that needs to be enabled to make sure metrics
>> are exported and viewable to the Flink UI?
>> >
>> > Thanks
>>
>>>


Re: Flink metrics missing from UI 1.7.2

2019-03-23 Thread David Anderson
Because latency tracking is expensive, it is turned off by default. You
turn it on by setting the interval; that looks something like this:

env.getConfig().setLatencyTrackingInterval(1000);

The full set of configuration options is described in the docs:
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#metrics

David Anderson | Training Coordinator
Follow us @VervericaData

--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time


On Sat, Mar 23, 2019 at 1:03 PM Padarn Wilson  wrote:
>
> Hi User,
>
> I am running Fink 1.7.2 on Kubernetes in a setup with task manager and
job manager separate.
>
> I'm having trouble seeing the metrics from my Flink job in the UI
dashboard. Actually I'm using the Datadog reporter to expose most of my
metrics, but latency tracking does not seem to be exported.
>
> Is there anything extra that needs to be enabled to make sure metrics are
exported and viewable to the Flink UI?
>
> Thanks

>


Re: Flink metrics in kubernetes deployment

2018-12-18 Thread Chesnay Schepler
If you're working with 1.7/master you're probably running into 
https://issues.apache.org/jira/browse/FLINK-11127 .


On 17.12.2018 18:12, eric hoffmann wrote:

Hi,
In a Kubernetes delpoyment, im not able to display metrics in the dashboard, I 
try to expose and fix the metrics.internal.query-service.port variable
But nothing. Do you have any ideas?
Thx
Eric






Re: Flink metrics related problems/questions

2017-05-22 Thread Aljoscha Krettek
Ah ok, the onTimer() and processElement() methods are all protected by 
synchronized blocks on the same lock. So that shouldn’t be a problem.

> On 22. May 2017, at 15:08, Chesnay Schepler  wrote:
> 
> Yes, that could cause the observed issue.
> 
> The default implementations are not thread-safe; if you do concurrent writes 
> they may be lost/overwritten.
> You will have to either guard accesses to that metric with a synchronized 
> block or implement your own thread-safe counter.
> 
> On 22.05.2017 14:17, Aljoscha Krettek wrote:
>> @Chesnay With timers it will happen that onTimer() is called from a 
>> different Thread than the Tread that is calling processElement(). If Metrics 
>> updates happen in both, would that be a problem?
>> 
>>> On 19. May 2017, at 11:57, Chesnay Schepler  wrote:
>>> 
>>> 2. isn't quite accurate actually; metrics on the TaskManager are not 
>>> persisted across restarts.
>>> 
>>> On 19.05.2017 11:21, Chesnay Schepler wrote:
 1. This shouldn't happen. Do you access the counter from different threads?
 
 2. Metrics in general are not persisted across restarts, and there is no 
 way to configure flink to do so at the moment.
 
 3. Counters are sent as gauges since as far as I know StatsD counters are 
 not allowed to be decremented.
 
 On 19.05.2017 08:56, jaxbihani wrote:
> Background: We are using a job using ProcessFunction which reads data from
> kafka fires ~5-10K timers per second and sends matched events to 
> KafkaSink.
> We are collecting metrics for collecting no of active timers, no of timers
> scheduled etc. We use statsd reporter and monitor using Grafana dashboard 
> &
> RocksDBStateBackend backed by HDFS as state.
> 
> Observations/Problems:
> 1. *Counter value suddenly got reset:*  While job was running fine, on one
> fine moment, metric of a monotonically increasing counter (Counter where 
> we
> just used inc() operation) suddenly became 0 and then resumed from there
> onwards. Only exception in the logs were related to transient connectivity
> issues to datanodes. Also there was no other indicator of any failure
> observed after inspecting system metrics/checkpoint metrics.  It happened
> just once across multiple runs of a same job.
> 2. *Counters not retained during flink restart with savepoint*: Cancelled
> job with -s option taking savepoint and then restarted the job using the
> savepoint.  After restart metrics started from 0. I was expecting metric
> value of a given operator would also be part of state.
> 3. *Counter metrics getting sent as Gauge*: Using tcpdump I was inspecting
> the format in which metric are sent to statsd. I observed that even the
> metric which in my code were counters, were sent as gauges. I didn't get 
> why
> that was so.
> 
> Can anyone please add more insights into why above mentioned behaviors 
> would
> have happened?
> Also does flink store metric values as a part of state for stateful
> operators? Is there any way to configure that?
> 
> 
> 
> 
> -- 
> View this message in context: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-metrics-related-problems-questions-tp13218.html
> Sent from the Apache Flink User Mailing List archive. mailing list 
> archive at Nabble.com.
> 
 
>> 
> 



Re: Flink metrics related problems/questions

2017-05-22 Thread Chesnay Schepler

Yes, that could cause the observed issue.

The default implementations are not thread-safe; if you do concurrent 
writes they may be lost/overwritten.
You will have to either guard accesses to that metric with a 
synchronized block or implement your own thread-safe counter.


On 22.05.2017 14:17, Aljoscha Krettek wrote:

@Chesnay With timers it will happen that onTimer() is called from a different 
Thread than the Tread that is calling processElement(). If Metrics updates 
happen in both, would that be a problem?


On 19. May 2017, at 11:57, Chesnay Schepler  wrote:

2. isn't quite accurate actually; metrics on the TaskManager are not persisted 
across restarts.

On 19.05.2017 11:21, Chesnay Schepler wrote:

1. This shouldn't happen. Do you access the counter from different threads?

2. Metrics in general are not persisted across restarts, and there is no way to 
configure flink to do so at the moment.

3. Counters are sent as gauges since as far as I know StatsD counters are not 
allowed to be decremented.

On 19.05.2017 08:56, jaxbihani wrote:

Background: We are using a job using ProcessFunction which reads data from
kafka fires ~5-10K timers per second and sends matched events to KafkaSink.
We are collecting metrics for collecting no of active timers, no of timers
scheduled etc. We use statsd reporter and monitor using Grafana dashboard &
RocksDBStateBackend backed by HDFS as state.

Observations/Problems:
1. *Counter value suddenly got reset:*  While job was running fine, on one
fine moment, metric of a monotonically increasing counter (Counter where we
just used inc() operation) suddenly became 0 and then resumed from there
onwards. Only exception in the logs were related to transient connectivity
issues to datanodes. Also there was no other indicator of any failure
observed after inspecting system metrics/checkpoint metrics.  It happened
just once across multiple runs of a same job.
2. *Counters not retained during flink restart with savepoint*: Cancelled
job with -s option taking savepoint and then restarted the job using the
savepoint.  After restart metrics started from 0. I was expecting metric
value of a given operator would also be part of state.
3. *Counter metrics getting sent as Gauge*: Using tcpdump I was inspecting
the format in which metric are sent to statsd. I observed that even the
metric which in my code were counters, were sent as gauges. I didn't get why
that was so.

Can anyone please add more insights into why above mentioned behaviors would
have happened?
Also does flink store metric values as a part of state for stateful
operators? Is there any way to configure that?




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-metrics-related-problems-questions-tp13218.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.









Re: Flink metrics related problems/questions

2017-05-22 Thread Aljoscha Krettek
@Chesnay With timers it will happen that onTimer() is called from a different 
Thread than the Tread that is calling processElement(). If Metrics updates 
happen in both, would that be a problem?

> On 19. May 2017, at 11:57, Chesnay Schepler  wrote:
> 
> 2. isn't quite accurate actually; metrics on the TaskManager are not 
> persisted across restarts.
> 
> On 19.05.2017 11:21, Chesnay Schepler wrote:
>> 1. This shouldn't happen. Do you access the counter from different threads?
>> 
>> 2. Metrics in general are not persisted across restarts, and there is no way 
>> to configure flink to do so at the moment.
>> 
>> 3. Counters are sent as gauges since as far as I know StatsD counters are 
>> not allowed to be decremented.
>> 
>> On 19.05.2017 08:56, jaxbihani wrote:
>>> Background: We are using a job using ProcessFunction which reads data from
>>> kafka fires ~5-10K timers per second and sends matched events to KafkaSink.
>>> We are collecting metrics for collecting no of active timers, no of timers
>>> scheduled etc. We use statsd reporter and monitor using Grafana dashboard &
>>> RocksDBStateBackend backed by HDFS as state.
>>> 
>>> Observations/Problems:
>>> 1. *Counter value suddenly got reset:*  While job was running fine, on one
>>> fine moment, metric of a monotonically increasing counter (Counter where we
>>> just used inc() operation) suddenly became 0 and then resumed from there
>>> onwards. Only exception in the logs were related to transient connectivity
>>> issues to datanodes. Also there was no other indicator of any failure
>>> observed after inspecting system metrics/checkpoint metrics.  It happened
>>> just once across multiple runs of a same job.
>>> 2. *Counters not retained during flink restart with savepoint*: Cancelled
>>> job with -s option taking savepoint and then restarted the job using the
>>> savepoint.  After restart metrics started from 0. I was expecting metric
>>> value of a given operator would also be part of state.
>>> 3. *Counter metrics getting sent as Gauge*: Using tcpdump I was inspecting
>>> the format in which metric are sent to statsd. I observed that even the
>>> metric which in my code were counters, were sent as gauges. I didn't get why
>>> that was so.
>>> 
>>> Can anyone please add more insights into why above mentioned behaviors would
>>> have happened?
>>> Also does flink store metric values as a part of state for stateful
>>> operators? Is there any way to configure that?
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> View this message in context: 
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-metrics-related-problems-questions-tp13218.html
>>> Sent from the Apache Flink User Mailing List archive. mailing list archive 
>>> at Nabble.com.
>>> 
>> 
>> 
> 



Re: Flink metrics related problems/questions

2017-05-19 Thread Chesnay Schepler
2. isn't quite accurate actually; metrics on the TaskManager are not 
persisted across restarts.


On 19.05.2017 11:21, Chesnay Schepler wrote:
1. This shouldn't happen. Do you access the counter from different 
threads?


2. Metrics in general are not persisted across restarts, and there is 
no way to configure flink to do so at the moment.


3. Counters are sent as gauges since as far as I know StatsD counters 
are not allowed to be decremented.


On 19.05.2017 08:56, jaxbihani wrote:
Background: We are using a job using ProcessFunction which reads data 
from
kafka fires ~5-10K timers per second and sends matched events to 
KafkaSink.
We are collecting metrics for collecting no of active timers, no of 
timers
scheduled etc. We use statsd reporter and monitor using Grafana 
dashboard &

RocksDBStateBackend backed by HDFS as state.

Observations/Problems:
1. *Counter value suddenly got reset:*  While job was running fine, 
on one
fine moment, metric of a monotonically increasing counter (Counter 
where we

just used inc() operation) suddenly became 0 and then resumed from there
onwards. Only exception in the logs were related to transient 
connectivity

issues to datanodes. Also there was no other indicator of any failure
observed after inspecting system metrics/checkpoint metrics.  It 
happened

just once across multiple runs of a same job.
2. *Counters not retained during flink restart with savepoint*: 
Cancelled

job with -s option taking savepoint and then restarted the job using the
savepoint.  After restart metrics started from 0. I was expecting metric
value of a given operator would also be part of state.
3. *Counter metrics getting sent as Gauge*: Using tcpdump I was 
inspecting

the format in which metric are sent to statsd. I observed that even the
metric which in my code were counters, were sent as gauges. I didn't 
get why

that was so.

Can anyone please add more insights into why above mentioned 
behaviors would

have happened?
Also does flink store metric values as a part of state for stateful
operators? Is there any way to configure that?




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-metrics-related-problems-questions-tp13218.html
Sent from the Apache Flink User Mailing List archive. mailing list 
archive at Nabble.com.









Re: Flink metrics related problems/questions

2017-05-19 Thread Chesnay Schepler

1. This shouldn't happen. Do you access the counter from different threads?

2. Metrics in general are not persisted across restarts, and there is no 
way to configure flink to do so at the moment.


3. Counters are sent as gauges since as far as I know StatsD counters 
are not allowed to be decremented.


On 19.05.2017 08:56, jaxbihani wrote:

Background: We are using a job using ProcessFunction which reads data from
kafka fires ~5-10K timers per second and sends matched events to KafkaSink.
We are collecting metrics for collecting no of active timers, no of timers
scheduled etc. We use statsd reporter and monitor using Grafana dashboard &
RocksDBStateBackend backed by HDFS as state.

Observations/Problems:
1. *Counter value suddenly got reset:*  While job was running fine, on one
fine moment, metric of a monotonically increasing counter (Counter where we
just used inc() operation) suddenly became 0 and then resumed from there
onwards. Only exception in the logs were related to transient connectivity
issues to datanodes. Also there was no other indicator of any failure
observed after inspecting system metrics/checkpoint metrics.  It happened
just once across multiple runs of a same job.
2. *Counters not retained during flink restart with savepoint*: Cancelled
job with -s option taking savepoint and then restarted the job using the
savepoint.  After restart metrics started from 0. I was expecting metric
value of a given operator would also be part of state.
3. *Counter metrics getting sent as Gauge*: Using tcpdump I was inspecting
the format in which metric are sent to statsd. I observed that even the
metric which in my code were counters, were sent as gauges. I didn't get why
that was so.

Can anyone please add more insights into why above mentioned behaviors would
have happened?
Also does flink store metric values as a part of state for stateful
operators? Is there any way to configure that?




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-metrics-related-problems-questions-tp13218.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.





Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-03 Thread Philipp Bussche
Hi there,
I am using Graphite and querying it in Grafana is super easy. You just
select fields and they come up automatically for you to select from
depending on how your metric structure in Graphite looks like. You can also
use wildcards.
The only thing I had to do because I am also using containers to run my
Flink components was to define a rather static naming for jobmanager and
task managers so that I wouldn't have to many new entities in my graphs when
I restart especially my task manager containers.
Thanks
Philipp



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Metrics-InfluxDB-Grafana-Help-with-query-influxDB-query-for-Grafana-to-plot-numRecordsIn-numRen-tp9775p9847.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-02 Thread Anchit Jatana
Hi Jamie,

Thanks for sharing your thoughts. I'll try and integrate with Graphite to
see if this gets resolved.

Regards,
Anchit



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Metrics-InfluxDB-Grafana-Help-with-query-influxDB-query-for-Grafana-to-plot-numRecordsIn-numRen-tp9775p9838.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-02 Thread Jamie Grier
Hi Anchit,

That last bit is very interesting - the fact that it works fine with
subtasks <= 30.  It could be that either Influx or Grafana are not able to
keep up with the data being produced.  I would guess that the culprit is
Grafana if looking at any particular subtask index works fine and only the
full aggregation shows issues.  I'm not familiar enough with Grafana to
know which parts of the queries are "pushed down" to the database and which
are done in Grafana.  This might also very by backend database.

Anecdotally, I've also seen scenarios using Grafana and Influx together
where the system seems to get overwhelmed fairly easily..  I suspect the
Graphite/Grafana combo would work a lot better in production setups.

This might be relevant:

https://github.com/grafana/grafana/issues/2634

-Jamie



On Tue, Nov 1, 2016 at 5:48 PM, Anchit Jatana 
wrote:

> I've set the metric reporting frequency to InfluxDB as 10s. In the
> screenshot, I'm using Grafana query interval of 1s. I've tried 10s and more
> too, the graph shape changes a bit but the incorrect negative values are
> still plotted(makes no difference).
>
> Something to add: If the subtasks are less than equal to 30, the same query
> yields correct results. For subtask index > 30 (for my case being 50) it
> plots junk negative and poistive values.
>
> Regards,
> Anchit
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Flink-Metrics-
> InfluxDB-Grafana-Help-with-query-influxDB-query-for-
> Grafana-to-plot-numRecordsIn-numRen-tp9775p9819.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>



-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier 
ja...@data-artisans.com


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-01 Thread Anchit Jatana
I've set the metric reporting frequency to InfluxDB as 10s. In the
screenshot, I'm using Grafana query interval of 1s. I've tried 10s and more
too, the graph shape changes a bit but the incorrect negative values are
still plotted(makes no difference).

Something to add: If the subtasks are less than equal to 30, the same query
yields correct results. For subtask index > 30 (for my case being 50) it
plots junk negative and poistive values.

Regards,
Anchit



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Metrics-InfluxDB-Grafana-Help-with-query-influxDB-query-for-Grafana-to-plot-numRecordsIn-numRen-tp9775p9819.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-01 Thread Jamie Grier
Hmm.  I can't recreate that behavior here.  I have seen some issues like
this if you're grouping by a time interval different from the metrics
reporting interval you're using, though.  How often are you reporting
metrics to Influx?  Are you using the same interval in your Grafana
queries?  I see in your queries you are using a time interval of 10
seconds.  Have you tried 1 second?  Do you see the same behavior?

-Jamie


On Tue, Nov 1, 2016 at 4:30 PM, Anchit Jatana 
wrote:

> Hi Jamie,
>
> Thank you so much for your response.
>
> The below query:
>
> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name" =
> 'Sink: Unnamed' AND $timeFilter GROUP BY time(1s)
>
> behaves the same as with the use of the templating variable in the 'All'
> case i.e. shows a plots of junk 'negative values'
>
> It shows accurate results/plot when an additional where clause for
> "subtask_index" is applied to the query.
>
> But without the "subtask_index" where clause (which means for all the
> subtask_indexes) it shows some junk/incorrect values on the graph (both
> highly positive & highly negative values in orders of millions)
>
> Images:
>
>  n4.nabble.com/file/n9816/Incorrect_%28for_all_subtasks%29.png>
>
>  n4.nabble.com/file/n9816/Correct_for_specific_subtask.png>
>
> Regards,
> Anchit
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Flink-Metrics-
> InfluxDB-Grafana-Help-with-query-influxDB-query-for-
> Grafana-to-plot-numRecordsIn-numRen-tp9775p9816.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>



-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier 
ja...@data-artisans.com


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-01 Thread Anchit Jatana
Hi Jamie,

Thank you so much for your response. 

The below query:

SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name" =
'Sink: Unnamed' AND $timeFilter GROUP BY time(1s)

behaves the same as with the use of the templating variable in the 'All'
case i.e. shows a plots of junk 'negative values'

It shows accurate results/plot when an additional where clause for
"subtask_index" is applied to the query.

But without the "subtask_index" where clause (which means for all the
subtask_indexes) it shows some junk/incorrect values on the graph (both
highly positive & highly negative values in orders of millions)

Images:


  


 

Regards,
Anchit



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Metrics-InfluxDB-Grafana-Help-with-query-influxDB-query-for-Grafana-to-plot-numRecordsIn-numRen-tp9775p9816.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-01 Thread Jamie Grier
Ahh.. I haven’t used templating all that much but this also works for your
substask variable so that you don’t have to enumerate all the possible
values:

Template Variable Type: query

query: SHOW TAG VALUES FROM numRecordsIn WITH KEY = "subtask_index"
​

On Tue, Nov 1, 2016 at 2:51 PM, Jamie Grier  wrote:

> Another note.  In the example the template variable type is "custom" and
> the values have to be enumerated manually.  So in your case you would have
> to configure all the possible values of "subtask" to be 0-49.
>
> On Tue, Nov 1, 2016 at 2:43 PM, Jamie Grier 
> wrote:
>
>> This works well for me. This will aggregate the data across all sub-task
>> instances:
>>
>> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
>> = 'Sink: Unnamed' AND $timeFilter GROUP BY time(1s)
>>
>> You can also plot each sub-task instance separately on the same graph by
>> doing:
>>
>> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
>> = 'Sink: Unnamed' AND $timeFilter GROUP BY time(1s), "subtask_index"
>>
>> Or select just a single subtask instance by using:
>>
>> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
>> = 'Sink: Unnamed' AND "subtask_index" = '7' AND $timeFilter GROUP BY
>> time(1s)
>>
>> I haven’t used the templating features much but this also seems to work
>> fine and allows you to select an individual subtask_index or ‘all’ and it
>> works as it should — summing across all subtasks when you select ‘all’.
>>
>> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
>> = 'Sink: Unnamed' AND "subtask_index" =~ /^$subtask$/ AND $timeFilter GROUP
>> BY time(1s)
>> ​
>>
>> On Fri, Oct 28, 2016 at 2:53 PM, Anchit Jatana <
>> development.anc...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I'm trying to plot the flink application metrics using grafana backed by
>>> influxdb. I need to plot/monitor the 'numRecordsIn' & 'numRecordsOut' for
>>> each operator/operation. I'm finding it hard to generate the influxdb query
>>> in grafana which can help me make this plot.
>>>
>>> I am able to plot the 'numRecordsIn' & 'numRecordsOut' for each
>>> subtask(parallelism set to 50) of the operator but not the operator as a
>>> whole.
>>>
>>> If somebody has knowledge or has successfully implemented this kind of a
>>> plot on grafana backed by influxdb, please share with me the process/query
>>> to achieve the same.
>>>
>>> Below is the query which I have to monitor the 'numRecordsIn' &
>>> 'numRecordsOut' for each subtask
>>>
>>> SELECT derivative(sum("count"), 10s) FROM "numRecordsOut" WHERE
>>> "task_name" = 'Source: Reading from Kafka' AND "subtask_index" =~
>>> /^$subtask$/ AND $timeFilter GROUP BY time(10s), "task_name"
>>>
>>> PS: $subtask is the templating variable that I'm using in order to have
>>> multiple subtask values. I have tried the 'All' option for this templating
>>> variable- This give me an incorrect plot showing me negative values while
>>> the individual selection of subtask values when selected from the
>>> templating variable drop down yields correct result.
>>>
>>> Thank you!
>>>
>>> Regards,
>>> Anchit
>>>
>>>
>>>
>>
>>
>> --
>>
>> Jamie Grier
>> data Artisans, Director of Applications Engineering
>> @jamiegrier 
>> ja...@data-artisans.com
>>
>>
>
>
> --
>
> Jamie Grier
> data Artisans, Director of Applications Engineering
> @jamiegrier 
> ja...@data-artisans.com
>
>


-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier 
ja...@data-artisans.com


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-01 Thread Jamie Grier
Another note.  In the example the template variable type is "custom" and
the values have to be enumerated manually.  So in your case you would have
to configure all the possible values of "subtask" to be 0-49.

On Tue, Nov 1, 2016 at 2:43 PM, Jamie Grier  wrote:

> This works well for me. This will aggregate the data across all sub-task
> instances:
>
> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
> = 'Sink: Unnamed' AND $timeFilter GROUP BY time(1s)
>
> You can also plot each sub-task instance separately on the same graph by
> doing:
>
> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
> = 'Sink: Unnamed' AND $timeFilter GROUP BY time(1s), "subtask_index"
>
> Or select just a single subtask instance by using:
>
> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
> = 'Sink: Unnamed' AND "subtask_index" = '7' AND $timeFilter GROUP BY
> time(1s)
>
> I haven’t used the templating features much but this also seems to work
> fine and allows you to select an individual subtask_index or ‘all’ and it
> works as it should — summing across all subtasks when you select ‘all’.
>
> SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name"
> = 'Sink: Unnamed' AND "subtask_index" =~ /^$subtask$/ AND $timeFilter GROUP
> BY time(1s)
> ​
>
> On Fri, Oct 28, 2016 at 2:53 PM, Anchit Jatana <
> development.anc...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm trying to plot the flink application metrics using grafana backed by
>> influxdb. I need to plot/monitor the 'numRecordsIn' & 'numRecordsOut' for
>> each operator/operation. I'm finding it hard to generate the influxdb query
>> in grafana which can help me make this plot.
>>
>> I am able to plot the 'numRecordsIn' & 'numRecordsOut' for each
>> subtask(parallelism set to 50) of the operator but not the operator as a
>> whole.
>>
>> If somebody has knowledge or has successfully implemented this kind of a
>> plot on grafana backed by influxdb, please share with me the process/query
>> to achieve the same.
>>
>> Below is the query which I have to monitor the 'numRecordsIn' &
>> 'numRecordsOut' for each subtask
>>
>> SELECT derivative(sum("count"), 10s) FROM "numRecordsOut" WHERE
>> "task_name" = 'Source: Reading from Kafka' AND "subtask_index" =~
>> /^$subtask$/ AND $timeFilter GROUP BY time(10s), "task_name"
>>
>> PS: $subtask is the templating variable that I'm using in order to have
>> multiple subtask values. I have tried the 'All' option for this templating
>> variable- This give me an incorrect plot showing me negative values while
>> the individual selection of subtask values when selected from the
>> templating variable drop down yields correct result.
>>
>> Thank you!
>>
>> Regards,
>> Anchit
>>
>>
>>
>
>
> --
>
> Jamie Grier
> data Artisans, Director of Applications Engineering
> @jamiegrier 
> ja...@data-artisans.com
>
>


-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier 
ja...@data-artisans.com


Re: Flink Metrics - InfluxDB + Grafana | Help with query influxDB query for Grafana to plot 'numRecordsIn' & 'numRecordsOut' for each operator/operation

2016-11-01 Thread Jamie Grier
This works well for me. This will aggregate the data across all sub-task
instances:

SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name" =
'Sink: Unnamed' AND $timeFilter GROUP BY time(1s)

You can also plot each sub-task instance separately on the same graph by
doing:

SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name" =
'Sink: Unnamed' AND $timeFilter GROUP BY time(1s), "subtask_index"

Or select just a single subtask instance by using:

SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name" =
'Sink: Unnamed' AND "subtask_index" = '7' AND $timeFilter GROUP BY time(1s)

I haven’t used the templating features much but this also seems to work
fine and allows you to select an individual subtask_index or ‘all’ and it
works as it should — summing across all subtasks when you select ‘all’.

SELECT derivative(sum("count"), 1s) FROM "numRecordsIn" WHERE "task_name" =
'Sink: Unnamed' AND "subtask_index" =~ /^$subtask$/ AND $timeFilter GROUP
BY time(1s)
​

On Fri, Oct 28, 2016 at 2:53 PM, Anchit Jatana  wrote:

> Hi All,
>
> I'm trying to plot the flink application metrics using grafana backed by
> influxdb. I need to plot/monitor the 'numRecordsIn' & 'numRecordsOut' for
> each operator/operation. I'm finding it hard to generate the influxdb query
> in grafana which can help me make this plot.
>
> I am able to plot the 'numRecordsIn' & 'numRecordsOut' for each
> subtask(parallelism set to 50) of the operator but not the operator as a
> whole.
>
> If somebody has knowledge or has successfully implemented this kind of a
> plot on grafana backed by influxdb, please share with me the process/query
> to achieve the same.
>
> Below is the query which I have to monitor the 'numRecordsIn' &
> 'numRecordsOut' for each subtask
>
> SELECT derivative(sum("count"), 10s) FROM "numRecordsOut" WHERE
> "task_name" = 'Source: Reading from Kafka' AND "subtask_index" =~
> /^$subtask$/ AND $timeFilter GROUP BY time(10s), "task_name"
>
> PS: $subtask is the templating variable that I'm using in order to have
> multiple subtask values. I have tried the 'All' option for this templating
> variable- This give me an incorrect plot showing me negative values while
> the individual selection of subtask values when selected from the
> templating variable drop down yields correct result.
>
> Thank you!
>
> Regards,
> Anchit
>
>
>


-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier 
ja...@data-artisans.com


Re: Flink Metrics

2016-10-18 Thread Aljoscha Krettek
https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html
Or this:
https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/metrics.html
if
you prefer Flink 1.1

On Mon, 17 Oct 2016 at 19:16 amir bahmanyari <amirto...@yahoo.com> wrote:

> Hi colleagues,
> Is there a link that described Flink Matrices & provides example on how to
> utilize it pls?
> I really appreciate it...
> Cheers
>
> --
> *From:* Till Rohrmann <trohrm...@apache.org>
> *To:* user@flink.apache.org
> *Cc:* d...@flink.apache.org
> *Sent:* Monday, October 17, 2016 12:52 AM
> *Subject:* Re: Flink Metrics
>
> Hi Govind,
>
> I think the DropwizardMeterWrapper implementation is just a reference
> implementation where it was decided to report the minute rate. You can
> define your own meter class which allows to configure the rate interval
> accordingly.
>
> Concerning Timers, I think nobody requested this metric so far. If you
> want, then you can open a JIRA issue and contribute it. The community would
> really appreciate that.
>
> Cheers,
> Till
> ​
>
> On Mon, Oct 17, 2016 at 5:26 AM, Govindarajan Srinivasaraghavan <
> govindragh...@gmail.com> wrote:
>
> > Hi,
> >
> > I am currently using flink 1.2 snapshot and instrumenting my pipeline
> with
> > flink metrics. One small suggestion I have is currently the Meter
> interface
> > only supports getRate() which is always the one minute rate.
> >
> > It would great if all the rates (1 min, 5 min & 15 min) are exposed to
> get
> > a better picture in terms of performance.
> >
> > Also is there any reason why timers are not part of flink metrics core?
> >
> > Regards,
> > Govind
> >
>
>
>


Re: Flink Metrics

2016-10-17 Thread amir bahmanyari
Hi colleagues,Is there a link that described Flink Matrices & provides example 
on how to utilize it pls?I really appreciate it...Cheers

  From: Till Rohrmann <trohrm...@apache.org>
 To: user@flink.apache.org 
Cc: d...@flink.apache.org
 Sent: Monday, October 17, 2016 12:52 AM
 Subject: Re: Flink Metrics
   
Hi Govind,

I think the DropwizardMeterWrapper implementation is just a reference
implementation where it was decided to report the minute rate. You can
define your own meter class which allows to configure the rate interval
accordingly.

Concerning Timers, I think nobody requested this metric so far. If you
want, then you can open a JIRA issue and contribute it. The community would
really appreciate that.

Cheers,
Till
​

On Mon, Oct 17, 2016 at 5:26 AM, Govindarajan Srinivasaraghavan <
govindragh...@gmail.com> wrote:

> Hi,
>
> I am currently using flink 1.2 snapshot and instrumenting my pipeline with
> flink metrics. One small suggestion I have is currently the Meter interface
> only supports getRate() which is always the one minute rate.
>
> It would great if all the rates (1 min, 5 min & 15 min) are exposed to get
> a better picture in terms of performance.
>
> Also is there any reason why timers are not part of flink metrics core?
>
> Regards,
> Govind
>

   

Re: Flink Metrics

2016-10-17 Thread Chesnay Schepler

Hello,

we could also offer a small utility method that creates 3 flink meters, 
each reporting one rate of a DW meter.


Timers weren't added yet since, as Till said, no one requested them yet 
and we haven't found a proper internal use-case for them


Regards,
Chesnay

On 17.10.2016 09:52, Till Rohrmann wrote:


Hi Govind,

I think the |DropwizardMeterWrapper| implementation is just a 
reference implementation where it was decided to report the minute 
rate. You can define your own meter class which allows to configure 
the rate interval accordingly.


Concerning Timers, I think nobody requested this metric so far. If you 
want, then you can open a JIRA issue and contribute it. The community 
would really appreciate that.


Cheers,
Till

​

On Mon, Oct 17, 2016 at 5:26 AM, Govindarajan Srinivasaraghavan 
> wrote:


Hi,

I am currently using flink 1.2 snapshot and instrumenting my
pipeline with flink metrics. One small suggestion I have is
currently the Meter interface only supports getRate() which is
always the one minute rate.

It would great if all the rates (1 min, 5 min & 15 min) are
exposed to get a better picture in terms of performance.

Also is there any reason why timers are not part of flink metrics
core?

Regards,
Govind






Re: Flink Metrics

2016-10-17 Thread Till Rohrmann
Hi Govind,

I think the DropwizardMeterWrapper implementation is just a reference
implementation where it was decided to report the minute rate. You can
define your own meter class which allows to configure the rate interval
accordingly.

Concerning Timers, I think nobody requested this metric so far. If you
want, then you can open a JIRA issue and contribute it. The community would
really appreciate that.

Cheers,
Till
​

On Mon, Oct 17, 2016 at 5:26 AM, Govindarajan Srinivasaraghavan <
govindragh...@gmail.com> wrote:

> Hi,
>
> I am currently using flink 1.2 snapshot and instrumenting my pipeline with
> flink metrics. One small suggestion I have is currently the Meter interface
> only supports getRate() which is always the one minute rate.
>
> It would great if all the rates (1 min, 5 min & 15 min) are exposed to get
> a better picture in terms of performance.
>
> Also is there any reason why timers are not part of flink metrics core?
>
> Regards,
> Govind
>