Hi all,

There has been some downtime this morning (about 15 minutes) due to a  
software update.

I pushed a software update, and immediately servers started crashing  
according to nagios. Looking at ganglia, it looks like the issue was  
the familiar issue where scap pushes a few 4-CPU apaches into swap,  
which then crash and come back a few minutes later. This time,  
however, obviously a key memcached node fell over, causing a database  
overload, resulting in the site being mostly inaccessible for about  
ten minutes.

I prepared to revert the software update, but determined that the  
problem was not the software update, and a scap would exacerbate the  
issue. The problem resolved itself spontaneously.

We need to fix things up so the scap script is less liable to push  
machines into swap :)

--
Andrew Garrett
[email protected]
http://werdn.us/


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to