[MediaWiki-commits] [Gerrit] operations/puppet[production]: Correct and simplify EventLogging monitoring

2016-10-18 Thread Ottomata (Code Review)
Ottomata has submitted this change and it was merged.

Change subject: Correct and simplify EventLogging monitoring
..


Correct and simplify EventLogging monitoring

EventLogging monitoring was incorrectly calculating the difference
between valid and invalid events.  The valid event metric included
EventError which is a kafka topic where invalid events are sent.  This
fixes that and also removes the use of server-side event monitoring or
mention in role::eventlogging.

Bug: T147321
Change-Id: I8b7aadecb9cf2ef43f2b7a4a638d797271dfac9e
---
M manifests/role/eventlogging.pp
M modules/eventlogging/manifests/monitoring/graphite.pp
2 files changed, 14 insertions(+), 18 deletions(-)

Approvals:
  Ottomata: Verified; Looks good to me, approved



diff --git a/manifests/role/eventlogging.pp b/manifests/role/eventlogging.pp
index 2845f8d..f3086ce 100644
--- a/manifests/role/eventlogging.pp
+++ b/manifests/role/eventlogging.pp
@@ -55,9 +55,8 @@
 # to your query params.
 $kafka_base_uri= inline_template('kafka:///<%= 
@kafka_brokers_array.join(":9092,") + ":9092" %>')
 
-# Read in server side and client side raw events from
-# Kafka, process them, and send events to schema
-# based topics in Kafka.
+# Read in raw events from Kafka, process them, and send them to
+# the schema corresponding to their topic in Kafka.
 $kafka_schema_uri  = "${kafka_base_uri}?topic=eventlogging_{schema}"
 
 # The downstream eventlogging MySQL consumer expects schemas to be
@@ -70,7 +69,6 @@
 default => 
"${kafka_base_uri}?topic=eventlogging-valid-mixed&blacklist=${mixed_schema_blacklist}"
 }
 
-$kafka_server_side_raw_uri = 
"${kafka_base_uri}?topic=eventlogging-server-side"
 $kafka_client_side_raw_uri = 
"${kafka_base_uri}?topic=eventlogging-client-side"
 
 # This check was written for eventlog1001, so only include it there.,
diff --git a/modules/eventlogging/manifests/monitoring/graphite.pp 
b/modules/eventlogging/manifests/monitoring/graphite.pp
index 5fc9dd0..0275a4d 100644
--- a/modules/eventlogging/manifests/monitoring/graphite.pp
+++ b/modules/eventlogging/manifests/monitoring/graphite.pp
@@ -9,8 +9,9 @@
 #kafka::server::jmxtrans
 #
 class eventlogging::monitoring::graphite($kafka_brokers_graphite_wildcard) {
-$raw_events_rate_metric   = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.{eventlogging-client-side,eventlogging-server-side}.OneMinuteRate)"
-$valid_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_*.OneMinuteRate)"
+$raw_events_rate_metric   = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging-client-side.OneMinuteRate)"
+$error_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_EventError.OneMinuteRate)"
+$navigation_timing_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_NavigationTiming.OneMinuteRate)"
 
 # Warn if 15% of overall event throughput goes beyond 1000 events/s
 # in a 15 min period.
@@ -28,7 +29,6 @@
 # Alarms if 15% of Navigation Timing event throughput goes under 1 req/sec
 # in a 15 min period
 # https://meta.wikimedia.org/wiki/Schema:NavigationTiming
-$navigation_timing_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_NavigationTiming.OneMinuteRate)"
 monitoring::graphite_threshold { 
'eventlogging_NavigationTiming_throughput':
 description   => 'Throughput of EventLogging NavigationTiming events',
 metric=> $navigation_timing_events_rate_metric,
@@ -40,19 +40,17 @@
 under => true
 }
 
-# Warn/Alert if the difference between raw and valid EventLogging
-# alerts gets too big.  We put a 10 minute lag because of metrics
-# not being correct in graphite before.
-# If the difference gets too big, either the validation step is
-# overloaded, or high volume schemas are failing validation.
-monitoring::graphite_threshold { 'eventlogging_difference_raw_validated':
-description   => 'Difference between raw and validated EventLogging 
overall message rates',
-metric=> 
"absolute(diffSeries(${raw_events_rate_metric},${valid_events_rate_metric}))",
+# Warn if 15% of overall error event throughput goes above 20 events/s
+# in a 15 minute period.
+# The EventError topic coun

[MediaWiki-commits] [Gerrit] operations/puppet[production]: Correct and simplify EventLogging monitoring

2016-10-18 Thread Milimetric (Code Review)
Milimetric has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/316567

Change subject: Correct and simplify EventLogging monitoring
..

Correct and simplify EventLogging monitoring

EventLogging monitoring was incorrectly calculating the difference
between valid and invalid events.  The valid event metric included
EventError which is a kafka topic where invalid events are sent.  This
fixes that and also removes the use of server-side event monitoring or
mention in role::eventlogging.

Bug: T147321
Change-Id: I8b7aadecb9cf2ef43f2b7a4a638d797271dfac9e
---
M manifests/role/eventlogging.pp
M modules/eventlogging/manifests/monitoring/graphite.pp
2 files changed, 12 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/67/316567/1

diff --git a/manifests/role/eventlogging.pp b/manifests/role/eventlogging.pp
index 2845f8d..f3086ce 100644
--- a/manifests/role/eventlogging.pp
+++ b/manifests/role/eventlogging.pp
@@ -55,9 +55,8 @@
 # to your query params.
 $kafka_base_uri= inline_template('kafka:///<%= 
@kafka_brokers_array.join(":9092,") + ":9092" %>')
 
-# Read in server side and client side raw events from
-# Kafka, process them, and send events to schema
-# based topics in Kafka.
+# Read in raw events from Kafka, process them, and send them to
+# the schema corresponding to their topic in Kafka.
 $kafka_schema_uri  = "${kafka_base_uri}?topic=eventlogging_{schema}"
 
 # The downstream eventlogging MySQL consumer expects schemas to be
@@ -70,7 +69,6 @@
 default => 
"${kafka_base_uri}?topic=eventlogging-valid-mixed&blacklist=${mixed_schema_blacklist}"
 }
 
-$kafka_server_side_raw_uri = 
"${kafka_base_uri}?topic=eventlogging-server-side"
 $kafka_client_side_raw_uri = 
"${kafka_base_uri}?topic=eventlogging-client-side"
 
 # This check was written for eventlog1001, so only include it there.,
diff --git a/modules/eventlogging/manifests/monitoring/graphite.pp 
b/modules/eventlogging/manifests/monitoring/graphite.pp
index 5fc9dd0..67dc0a3 100644
--- a/modules/eventlogging/manifests/monitoring/graphite.pp
+++ b/modules/eventlogging/manifests/monitoring/graphite.pp
@@ -9,8 +9,9 @@
 #kafka::server::jmxtrans
 #
 class eventlogging::monitoring::graphite($kafka_brokers_graphite_wildcard) {
-$raw_events_rate_metric   = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.{eventlogging-client-side,eventlogging-server-side}.OneMinuteRate)"
-$valid_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_*.OneMinuteRate)"
+$raw_events_rate_metric   = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging-client-side.OneMinuteRate)"
+$invalid_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_EventError.OneMinuteRate)"
+$navigation_timing_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_NavigationTiming.OneMinuteRate)"
 
 # Warn if 15% of overall event throughput goes beyond 1000 events/s
 # in a 15 min period.
@@ -28,7 +29,6 @@
 # Alarms if 15% of Navigation Timing event throughput goes under 1 req/sec
 # in a 15 min period
 # https://meta.wikimedia.org/wiki/Schema:NavigationTiming
-$navigation_timing_events_rate_metric = 
"sumSeries(kafka.cluster.analytics-eqiad.kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_NavigationTiming.OneMinuteRate)"
 monitoring::graphite_threshold { 
'eventlogging_NavigationTiming_throughput':
 description   => 'Throughput of EventLogging NavigationTiming events',
 metric=> $navigation_timing_events_rate_metric,
@@ -40,19 +40,15 @@
 under => true
 }
 
-# Warn/Alert if the difference between raw and valid EventLogging
-# alerts gets too big.  We put a 10 minute lag because of metrics
-# not being correct in graphite before.
-# If the difference gets too big, either the validation step is
-# overloaded, or high volume schemas are failing validation.
-monitoring::graphite_threshold { 'eventlogging_difference_raw_validated':
-description   => 'Difference between raw and validated EventLogging 
overall message rates',
-metric=> 
"absolute(diffSeries(${raw_events_rate_metric},${valid_events_rate_metric}))",
+# Warn if 15% of overall invalid event throughput goes above 20 event