Our database cannot keep up and we're trying to figure out why.
Symptoms: - Nagios is reporting too many processes, 100+ waiting to execute - Thousands of slow queries over 1 sec The facts: - We have a cluster of four mail servers running SpamAssassin (v3.2.4), and they all use one MySQL(v4.1.22) server (RHEL4.6) running InnoDB, which replicates to a back-up server in case the primary one goes down. - We are doing about 200 writes a second. - We are not CPU bound, memory, disk, or IO bound. Here is a typical slow query: # u...@host: readwrite[readwrite] @ [10.193.24.70] # Query_time: 6 Lock_time: 0 Rows_sent: 0 Rows_examined: 0 INSERT INTO bayes_token (id, token, spam_count, ham_count, atime) VALUES ('1','<C4><A2><E6><86>\"','1','0','1234301303') ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count + '1', 0), ham_count = GREATEST(ham_count + '0', 0), atime = GREATEST(atime, '1234301303'); Occasionally I see a slow querry like: # u...@host: readwrite[readwrite] @ [10.193.24.67] # Query_time: 2 Lock_time: 0 Rows_sent: 0 Rows_examined: 0 UPDATE bayes_token SET atime = '1234376219' WHERE id = '1' AND token IN ('³^A~O}^P','6X;¡^C','ãC~_RÕ','XâÍ D','ÐIJ| Õ','~AA-¤¨','GD~^N~\','{z~M³^L','&~L^L6^N','Rlñ^U','¾Õ8õ^_','»á«| w','!ÇAåÝ','P3>~Cý','~L~Z^Ct~V','ÅÙ^F=','Ý~GÝÇä','g«\'^H~A','<pÕ~Q.','N²ì¬^X','Ö°~J^^«','å4ëIF','\ZøÍá¤','!,^OÑ^W','¸~F~_Ä~E','ù)=à','ðÌx¤ã','x^Uç¦Z','ïöÏÏ~M','²t¶¥~C','~\z¯)Í','~Càí~Lc','¤±~K×v','í~]ÅP¨','Ù9¹%Ê','ª~C¨kÆ','d?ºªF','znÞ¦ ','~Rù?~N!','~^KÂËÈ','2^VÉ9w','áÒ&¿Ô','À¸Þ~Xþ','¹Ñ^Kñß','Y\r~_~O^L','^YbL9K','W{ág.','^°fc·','¸÷~R^]ü','RèD^Ug','~]yºIÙ','¦^R.^ ','×^F9H~L','[~Q~I^Oí','× µÎö','~_hÞNÏ','VL>| !','~BÄ>)í','^\ô~J»ç','Ã8ÁËn','s/y[T','^]SÙ~Dô','dã~XU','¹r[~]%','~...@ñ ~Q÷','~Kº)Y~J','Ü^[^R^S^','ºzØèP','D~K^Oc~H','^...@w~v','*Õµ^TÒ','ÐÙ\nu!','Ë5ç\\<','èë,=è','~WGW~Eâ','úæE¼_','\0~U¨Çã','Bî~íå','ôY^T>~^','^KMp0~\','<Mvå[',']É=ÓÄ','\\CI^Pî','V¦ñâÒ','G(~Dû^O','·ä\"MÀ',':+~TÉO','ËÐ\'|¡','~J#Ràî','~FqDK[','\r`ª~D»','~]~Qª| ÿ','^...@ßÜ','´±95Û','±{^¸s','^\i g^?','£^\`T~P','Yì,÷Ë','µ~_~Iß-','~C~V~V^^/','4÷ËHJ','^DϪ¼^U','t~\Õº~Q','\"¹[ªÛ','K¨~Mé^Q',' ïò\0\0','^S~L/^R~J','B~Yw~YZ','\\áÍ~Xd','êaÙè¤','`^Vß(| ','J¾^^^Bÿ','<W~Qóp','âWЯÆ','k&(~Z^^','ë~HGÁJ','h^\ç~Cú','~M~Q[F^O','W^L=~^?','÷~M^L~_Y','vçeH~K','Á}ª^Qä','¿kÿër','~...@kû','\0^Uhc*','ÖyE7~^','~]\0¾4g','µÔû[~X','©\noÃ\0','P9ÐÖ[','O¦ä^U^D','R±èîç','©^_~YKá','^LP¦Õé','DÛ^NÍ£','A^HÇä','½x~Z^RÑ','%)~ClÈ','§»´Æ','^N¤Ä0Ð','¸+ü8Â','Ów±Ýl','ÝÌ*FT','/>D}w','n·h^U','©¼4^AÜ','cï~M%ª','~BÎÈ~Gä','{5ä~Yq','/üå~BØ','Û¡ö8(') AND atime < '1234376219'; Here is a dump magic to give you an idea of the number of records: 0.000 0 3 0 non-token data: bayes db version 0.000 0 23548001 0 non-token data: nspam 0.000 0 298117 0 non-token data: nham 0.000 0 5490277 0 non-token data: ntokens 0.000 0 1234333018 0 non-token data: oldest atime 0.000 0 1234376307 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1234376221 0 non-token data: last expiry atime 0.000 0 43200 0 non-token data: last expire atime delta 0.000 0 64446 0 non-token data: last expire reduction count I don't think we're poorly configure, nor do I think we're receiving a huge number of messages for a medium sized business (according to our Cacti graphs we usually get 1000-4000 messages every 5 minutes, sometimes spiking to 8,000 over 5 minutes). At the present rate of spam growth and the DB not being able to keep up, out mail system is going to keel over in the next week or two. We're still researching, but our DBA suspects the issue is with Spamassassin locking the tables and possibly poorly written SQL queries. Has anyone else had this or similar problems? If so what was your solution? Thanks, -Adam