[MediaWiki-commits] [Gerrit] operations/puppet[production]: mediawiki::appserver::api: add load monitoring

2018-01-03 Thread Giuseppe Lavagetto (Code Review)
Giuseppe Lavagetto has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/401714 )

Change subject: mediawiki::appserver::api: add load monitoring
..


mediawiki::appserver::api: add load monitoring

We've had quite a few cases of HHVM API appservers with high cpu usage
causing noitceable latencies to users; the easiest way to detect such
deadlocks is quite simply checking the machine CPU usage/load - at least
that's what I do manually.

This change won't solve the issue per-se, but it will make ops aware of
what is going on proactively.

Bug: T182568
Bug: T184048
Change-Id: I06af45cbf8f42ade5753dc7397c6e1aa2b32c4ea
---
A modules/profile/manifests/mediawiki/api.pp
M modules/role/manifests/mediawiki/appserver/api.pp
2 files changed, 23 insertions(+), 8 deletions(-)

Approvals:
  Giuseppe Lavagetto: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/modules/profile/manifests/mediawiki/api.pp 
b/modules/profile/manifests/mediawiki/api.pp
new file mode 100644
index 000..0f0e9d6
--- /dev/null
+++ b/modules/profile/manifests/mediawiki/api.pp
@@ -0,0 +1,22 @@
+# == Class profile::mediawiki::api
+#
+# Specific settings for the mediawiki API servers
+class profile::mediawiki::api {
+# Using fastcgi we need more local ports
+sysctl::parameters { 'raise_port_range':
+values   => { 'net.ipv4.local_port_range' => '22500 65535', },
+priority => 90,
+}
+
+# Check the load to detect clearly hosts hanging (see T184048, T182568)
+$nproc = $facts['processorcount']
+$warning = join([ $nproc * 0.95, $nproc * 0.8, $nproc * 0.75], ',')
+$critical = join([ $nproc * 1.5, $nproc * 1.1, $nproc * 1], ',')
+# Since we're checking the load, that is already a moving average, we can
+# alarm at the first occurrence
+nrpe::monitor_service { 'cpu_load':
+description  => 'High CPU load on API appserver',
+nrpe_command => "/usr/lib/nagios/plugins/check_load -w ${warning} -c 
${critical}",
+retries  => 1,
+}
+}
diff --git a/modules/role/manifests/mediawiki/appserver/api.pp 
b/modules/role/manifests/mediawiki/appserver/api.pp
index 83494bb..c36d4b7 100644
--- a/modules/role/manifests/mediawiki/appserver/api.pp
+++ b/modules/role/manifests/mediawiki/appserver/api.pp
@@ -5,12 +5,5 @@
 include ::profile::base::firewall
 include ::profile::prometheus::apache_exporter
 include ::profile::prometheus::hhvm_exporter
-
-# Using fastcgi we need more local ports
-sysctl::parameters { 'raise_port_range':
-values   => {
-'net.ipv4.local_port_range' => '22500 65535',
-},
-priority => 90,
-}
+include ::profile::mediawiki::api
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/401714
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I06af45cbf8f42ade5753dc7397c6e1aa2b32c4ea
Gerrit-PatchSet: 5
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Giuseppe Lavagetto 
Gerrit-Reviewer: Elukey 
Gerrit-Reviewer: Giuseppe Lavagetto 
Gerrit-Reviewer: Muehlenhoff 
Gerrit-Reviewer: jenkins-bot <>

___
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits


[MediaWiki-commits] [Gerrit] operations/puppet[production]: mediawiki::appserver::api: add load monitoring

2018-01-03 Thread Giuseppe Lavagetto (Code Review)
Giuseppe Lavagetto has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/401714 )

Change subject: mediawiki::appserver::api: add load monitoring
..

mediawiki::appserver::api: add load monitoring

We've had quite a few cases of HHVM API appservers with high cpu usage
causing noitceable latencies to users; the easiest way to detect such
deadlocks is quite simply checking the machine CPU usage/load - at least
that's what I do manually.

This change won't solve the issue per-se, but it will make ops aware of
what is going on proactively.

Bug: T182568, T184048
Change-Id: I06af45cbf8f42ade5753dc7397c6e1aa2b32c4ea
---
M modules/role/manifests/mediawiki/appserver/api.pp
1 file changed, 11 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/14/401714/1

diff --git a/modules/role/manifests/mediawiki/appserver/api.pp 
b/modules/role/manifests/mediawiki/appserver/api.pp
index 83494bb..39d2ffb 100644
--- a/modules/role/manifests/mediawiki/appserver/api.pp
+++ b/modules/role/manifests/mediawiki/appserver/api.pp
@@ -13,4 +13,15 @@
 },
 priority => 90,
 }
+
+# Check the load to detect clearly hosts hanging (see T184048, T182568)
+$nproc = $facts['processorcount']
+$warning = join([ $nproc * 0.95, $nproc * 0.8, $nproc * 0.75], ',')
+$critical = join([ $nproc * 1.5, $nproc * 1.1, $nproc * 1], ',')
+# Since we're checking the load, that is already a moving average, we can
+# alarm at the first occurrence
+nrpe::monitor_service { 'cpu_load':
+command => "check_load -w ${warning} -c ${critical}",
+retries => 1,
+}
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/401714
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I06af45cbf8f42ade5753dc7397c6e1aa2b32c4ea
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Giuseppe Lavagetto 

___
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits