[Wikitech-l] why we had slowdowns

Domas Mituzas Fri, 13 Feb 2009 00:32:17 -0800

Hello,

probably it is worth telling why we had this headless chicken run  
lately with all these new servers, and why we didn't do this slowly  
but surely before all the slowdowns hit us.

We use Ganglia to understand cluster capacity, and the major overview
place is at:
http://ganglia.wikimedia.org/pmtpa/?gw=fwd&gs=Wikimedia%40http%3A%2F%2Fganglia.wikimedia.org%2F

One of things that distorted our understanding was that aggregate
graph didn't exclude servers that were out of rotation for one reason
or another, so besides highly-loaded servers in the average
calculations, we had 0 load in the mix - thus showing quite some white
space in the aggregate. Fixing that immediately showed somewhat worse
situation than it used to look before (though looking at per-host
statistics already showed the problem).

Other issue is that any long-term graph (say monthly or yearly) shows
averages that do not represent peak time load properly.
So we do see average increase, but it does not look _that_ frightening.

Then... we had _sharp_ CPU usage increase three weeks ago. That still
needs investigation - but this could be anything, from some evil
metatemplate introduced on major wiki (maybe {{cite}} stuff
changed? :) to some bot that is hitting slower code paths to simply
bad code.

So, from operations perspective, it would be really nice to have:

a) A long-term data collection of maximum CPU load values (uhm, say,
maximum hourly averages).
b) Graphing / longterm data collection for profiling points
c) Ability to profile template costs/impacts apart from general Parser
profiling.
d) Alarm when we hit something above threshold. We have to notice that
within hour, not within days. :)

Or of course, being more attentive helps too.

Cheers,
--
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] why we had slowdowns

Reply via email to