Our database cannot keep up and we're trying to figure out why.

 

Symptoms:

-          Nagios is reporting too many processes, 100+ waiting to execute

-          Thousands of slow queries over 1 sec

 

The facts:

-          We have a cluster of four mail servers running SpamAssassin 
(v3.2.4), and they all use one MySQL(v4.1.22) server (RHEL4.6) running InnoDB, 
which replicates to a back-up server in case the primary one goes down.

-          We are doing about 200 writes a second.

-          We are not CPU bound, memory, disk, or IO bound. 

 

Here is a typical slow query:

# u...@host: readwrite[readwrite] @  [10.193.24.70]

# Query_time: 6  Lock_time: 0  Rows_sent: 0  Rows_examined: 0

INSERT INTO bayes_token

               (id, token, spam_count, ham_count, atime)

               VALUES ('1','<C4><A2><E6><86>\"','1','0','1234301303')

               ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count + '1', 
0),

                                       ham_count = GREATEST(ham_count + '0', 0),

                                       atime = GREATEST(atime, '1234301303');

 

Occasionally I see a slow querry like:

# u...@host: readwrite[readwrite] @  [10.193.24.67]

# Query_time: 2  Lock_time: 0  Rows_sent: 0  Rows_examined: 0

UPDATE bayes_token SET atime = '1234376219' WHERE id = '1' AND token IN 
('³^A~O}^P','6X;¡^C','ãC~_RÕ','XâÍ D','ÐIJ| 
Õ','~AA-¤¨','GD~^N~\','{z~M³^L','&~L^L6^N','Rlñ­^U','¾Õ8õ^_','»á«| 
w','!ÇAåÝ','P3>~Cý','~L~Z^Ct~V','­ÅÙ^F=','Ý~GÝÇä','g«\'^H~A','<pÕ~Q.','N²ì¬^X','Ö°~J^^«','å4ëIF','\ZøÍá¤','!,^OÑ^W','¸~F~_Ä~E','ù)=­à','ðÌx¤ã','x^Uç¦Z','ïöÏÏ~M','²t¶¥~C','~\z¯)Í','~Càí~Lc','¤±~K×v','í~]ÅP¨','Ù9¹%Ê','ª~C¨kÆ','d?ºªF','znÞ¦
    
','~Rù?~N!','~^KÂËÈ','2^VÉ9w','áÒ&¿Ô','À¸Þ~Xþ','¹Ñ^Kñß','Y\r~_~O^L','^YbL9K','W{ág.','^°fc·','¸÷~R^]ü','RèD^Ug','~]yºIÙ','¦^R.^
 ','×^F9H~L','[~Q~I^Oí','× µÎö','~_hÞNÏ','VL>| 
!','~BÄ>)í','^\ô~J»ç','Ã8ÁËn','s/y[T','^]SÙ~Dô','dã­~XU','¹r[~]%','~...@ñ    
~Q÷','~Kº)Y~J','Ü^[^R^S^','ºzØèP','D~K^Oc~H','^...@w~v','*Õµ^TÒ','ÐÙ\nu!','Ë5ç\\<','èë,=è','~WGW~Eâ','úæE¼_','\0~U¨Çã','Bî~íå','ôY^T>~^','^KMp0~\','<Mvå[',']É=ÓÄ','\\CI^Pî','V¦ñâÒ','G(~Dû^O','·ä\"MÀ',':+~TÉO','ËÐ\'|¡','~J#Ràî','~FqDK[','\r`ª~D»','~]~Qª|
 ÿ','^...@ßÜ','´±95Û','±{^¸s','^\i  
g^?','£^\`T~P','Yì,÷Ë','µ~_~Iß-','~C~V~V^^/','4÷ËHJ','^DϪ¼^U','t~\Õº~Q','\"¹[ªÛ','K¨~Mé^Q','
   ïò\0\0','^S~L/^R~J','B~Yw~YZ','\\áÍ~Xd','êaÙè¤','`^Vß(| 
','J¾^^^Bÿ','<W~Qóp','âWЯÆ','k&(~Z^^','ë~HGÁJ','h^\ç~Cú','~M~Q[F^O','W^L=~^?','÷~M^L~_Y','vçeH~K','Á}ª^Qä','¿kÿër','~...@kû','\0^Uhc*','ÖyE7~^','~]\0¾4g','µÔû[~X','©\noÃ\0','P9ÐÖ[','O¦ä^U^D','R±èîç','©^_~YKá','^LP¦Õé','DÛ^NÍ£','A­^HÇä','½x~Z^RÑ','%)~ClÈ','§­»´Æ','^N¤Ä0Ð','¸+ü8Â','Ów±Ýl','ÝÌ*FT','/>D}w','n·­h^U','©¼4^AÜ','cï~M%ª','~BÎÈ~Gä','{5ä~Yq','/üå~BØ','Û¡ö8(')
 AND atime < '1234376219';

 

 

Here is a dump magic to give you an idea of the number of records:

0.000          0          3          0  non-token data: bayes db version

0.000          0   23548001          0  non-token data: nspam

0.000          0     298117          0  non-token data: nham

0.000          0    5490277          0  non-token data: ntokens

0.000          0 1234333018          0  non-token data: oldest atime

0.000          0 1234376307          0  non-token data: newest atime

0.000          0          0          0  non-token data: last journal sync atime

0.000          0 1234376221          0  non-token data: last expiry atime

0.000          0      43200          0  non-token data: last expire atime delta

0.000          0      64446          0  non-token data: last expire reduction 
count

 

 

I don't think we're poorly configure, nor do I think we're receiving a huge 
number of messages for a medium sized business (according to our Cacti graphs 
we usually get 1000-4000 messages every 5 minutes, sometimes spiking to 8,000 
over 5 minutes).

 

At the present rate of spam growth and the DB not being able to keep up, out 
mail system is going to keel over in the next week or two.

 

We're still researching, but our DBA suspects the issue is with Spamassassin 
locking the tables and possibly poorly written SQL queries.

 

Has anyone else had this or similar problems? If so what was your solution?

 

Thanks,

-Adam

Reply via email to