Re: Flink Metrics Naming

Mason Chen Tue, 01 Jun 2021 07:38:27 -0700

Makes sense. We are primarily concerned with removing the metric labels from 
the names as the user metrics get too long. i.e. the groups from `addGroup` are 
concatenated in the metric name.


Do you think there would be any issues with removing the group information in 
the metric name and putting them into a label instead? In seems like most 
metrics internally, don’t use `addGroup` to create group information but rather 
by creating another subclass of metric group.

Perhaps, I should ONLY apply this custom logic to metrics with the “user” 
scope? Other scoped metrics (e.g. operator, task operator, etc.) shouldn’t have 
these group names in the metric names in my experience...

An example just for clarity, 
flink_<system_scope>_group1_group2_metricName{group1=…, group2=…, flink tags}

=>

flink_<system_scope>_metricName{group_info=group1_group2, group1=…, group2=…, 
flink tags}

> On Jun 1, 2021, at 9:57 AM, Chesnay Schepler <ches...@apache.org> wrote:
> 
> The uniqueness of metrics and the naming of the Prometheus reporter are 
> somewhat related but also somewhat orthogonal.
> 
> Prometheus works similar to JMX in that the metric name (e.g., 
> taskmanager.job.task.operator.numRecordsIn) is more or less a _class_ of 
> metrics, with tags/labels allowing you to select a specific instance of that 
> metric.
> 
> Restricting metric names to 1 level of the hierarchy would present a few 
> issues:
> a) Effectively, all metric names that Flink uses effectively become reserved 
> keywords that users must not use, which will lead to headaches when adding 
> more metrics or forwarding metrics from libraries (e.g., kafka), because we 
> could always break existing user-defined metrics.
> b) You'd need a cluster-wide lookup that is aware of all hierarchies to 
> ensure consistency across all processes.
> 
> In the end, there are significantly easier ways to solve the issue of the 
> metric name being too long, i.e., give the user more control over the logical 
> scope (taskmanager.job.task.operator), be it shortening the names (t.j.t.o), 
> limiting the depth (e.g, operator.numRecordsIn), removing it outright (but 
> I'd prefer some context to be present for clarity) or supporting something 
> similar to scope formats.
> I'm reasonably certain there are some tickets already in this direction, we 
> just don't get around to doing them because for the most part the metric 
> system works good enough and there are bigger fish to fry.
> 
> On 6/1/2021 3:39 PM, Till Rohrmann wrote:
>> Hi Mason,
>> 
>> The idea is that a metric is not uniquely identified by its name alone but 
>> instead by its path. The groups in which it is defined specify this path 
>> (similar to directories). That's why it is valid to specify two metrics with 
>> the same name if they reside in different groups.
>> 
>> I think Prometheus does not support such a tree structure and that's why the 
>> path is exposed via labels if I am not mistaken. So long story short, what 
>> you are seeing is a combination of how Flink organizes metrics and what can 
>> be reported to Prometheus. 
>> 
>> I am also pulling in Chesnay who is more familiar with this part of the code.
>> 
>> Cheers,
>> Till
>> 
>> On Fri, May 28, 2021 at 7:33 PM Mason Chen <mason.c...@apple.com 
>> <mailto:mason.c...@apple.com>> wrote:
>> Can anyone give insight as to why Flink allows 2 metrics with the same 
>> “name”?
>> 
>> For example,
>> 
>> getRuntimeContext.addGroup(“group”, “group1”).counter(“myMetricName”);
>> 
>> And
>> 
>> getRuntimeContext.addGroup(“other_group”, 
>> “other_group1”).counter(“myMetricName”);
>> 
>> Are totally valid.
>> 
>> 
>> It seems that it has lead to some not-so-great implementations—the 
>> prometheus reporter and attaching the labels to the metric name, making the 
>> name quite verbose.
>> 
>> 
>

Re: Flink Metrics Naming

Reply via email to