[jira] [Commented] (FLINK-7200) Make metrics more Datadog friendly

2017-07-16 Thread Robert Batts (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089048#comment-16089048
 ] 

Robert Batts commented on FLINK-7200:
-

You're absolutely right. This is what I get for opening Jira on a Friday.

> Make metrics more Datadog friendly
> --
>
> Key: FLINK-7200
> URL: https://issues.apache.org/jira/browse/FLINK-7200
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.3.1
>Reporter: Robert Batts
>Priority: Minor
>
> The current output of the Datadog Reporter is a little unfriendly to the 
> platform they are going to from a metrics name perspective. Take for example 
> the metric used reporting with the Datadog Kafka integration.
> kafka.consumer_lag= [topic:, consumer_group: , partition: ]
> Through the use of tags (in this case topic, consumer_group, and partition) 
> you can create graphs in Datadog filtered to a specific topic and 
> consumer_group and then averaged on each partition. This allows you to 
> visualize something like a heatmap for lag on each partition for a consumer.
> So what am I suggesting for Flink? Currently, I think the tags for Datadog 
> are in a great place. Tags like job_id and subtask_id would be great for 
> filtering and grouping. But, the metric name is currently too specific to a 
> taskmanager and subtask. Currently, the metrics look something like this:
> flink_w04.taskmanager.4f378aff5730.TwitterExample.ExtractHashtags.7.numRecordsOut
> {host}.taskmanager.{tm_id}.{job_name}.{operator_name}.{subtask_index}.{metric_name}
> What I am suggesting is something more like this:
> taskmanager.TwitterExample.ExtractHashtags.numRecordsOut
> taskmanager.{job_name}.{operator_name}.{metric_name}
> (or even taskmanager.{metric_name}, but that would be a lot of tags on a 
> single metric)
> By doing this someone could create a graph on the numRecordsOut for an entire 
> task's metric with a single metric in Datadog rather than combining the 
> metric for every subtask_index using the tm_id metric (that could change if a 
> tm_id dropped out of the cluster.) Additionally, given the current set of 
> tags being output to Datadog there is a ton of grouping and filtering that 
> will be available if everything was on a simplified metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7200) Make metrics more Datadog friendly

2017-07-15 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088602#comment-16088602
 ] 

Chesnay Schepler commented on FLINK-7200:
-

You can configure the the components contained in the metric name using scope 
formats, as described in the metrics documentation: 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/metrics.html#scope

All your suggestions can be accomplished with this feature.

> Make metrics more Datadog friendly
> --
>
> Key: FLINK-7200
> URL: https://issues.apache.org/jira/browse/FLINK-7200
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.3.1
>Reporter: Robert Batts
>Priority: Minor
>
> The current output of the Datadog Reporter is a little unfriendly to the 
> platform they are going to from a metrics name perspective. Take for example 
> the metric used reporting with the Datadog Kafka integration.
> kafka.consumer_lag= [topic:, consumer_group: , partition: ]
> Through the use of tags (in this case topic, consumer_group, and partition) 
> you can create graphs in Datadog filtered to a specific topic and 
> consumer_group and then averaged on each partition. This allows you to 
> visualize something like a heatmap for lag on each partition for a consumer.
> So what am I suggesting for Flink? Currently, I think the tags for Datadog 
> are in a great place. Tags like job_id and subtask_id would be great for 
> filtering and grouping. But, the metric name is currently too specific to a 
> taskmanager and subtask. Currently, the metrics look something like this:
> flink_w04.taskmanager.4f378aff5730.TwitterExample.ExtractHashtags.7.numRecordsOut
> {host}.taskmanager.{tm_id}.{job_name}.{operator_name}.{subtask_index}.{metric_name}
> What I am suggesting is something more like this:
> taskmanager.TwitterExample.ExtractHashtags.numRecordsOut
> taskmanager.{job_name}.{operator_name}.{metric_name}
> (or even taskmanager.{metric_name}, but that would be a lot of tags on a 
> single metric)
> By doing this someone could create a graph on the numRecordsOut for an entire 
> task's metric with a single metric in Datadog rather than combining the 
> metric for every subtask_index using the tm_id metric (that could change if a 
> tm_id dropped out of the cluster.) Additionally, given the current set of 
> tags being output to Datadog there is a ton of grouping and filtering that 
> will be available if everything was on a simplified metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)