Well the simplest fix is the one that I did not implement in the first place, using "ON DUPLICATE KEY". However, I did not implement that because of its only being in version 4.1 of MySQL and I still use Debian stable for most production machines, which runs 4.0.x.

Anyway, I will poke at it some more.

- James


Jim C. Nasby wrote:
On Tue, Dec 27, 2005 at 09:33:11PM -0500, James Keating wrote:

Indeed. My thinking behind storing both system totals ($TOTALS) and user totals in the database was for easy error checking. Each day you can quickly/easily run through the database and look for potential errors (possibly resulting from the race condition that you mentioned). However, I have not seen any miscalculations yet.


Well, there won't be any errors with an ACID database (ie: not MyISAM)
and if the race condition is handled properly.


Regarding the race condition, I would be more than happy to fix it, and any input would be wonderful on that. I will poke around and see what I can come up with on my own.


I could give you a fix if it was PostgreSQL, but I'm not really a MySQL
person.


Jim C. Nasby wrote:

Neat plugin. I have two comments:

I wouldn't store $TOTALS or the total column in the database, as both
can easily be calculated when retrieving the data.

There is a race condition, especially for $TOTALS. First you check for
existence, then you try and do an insert or an update based on that. If
more than one SA thread is running, you can end up with duplicated
entries. I think MySQL has a merge command that would be more
appropriate.

Also, it appears that this would work on PostgreSQL with minimal
modifications. If anyone's interested in that I'd be happy to help.

On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:


Morning All,

I have released a real-time stats collection plugin (http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin. The plugin collects data into a MySQL DB based on the date and username that SpamAssassin was invoked with. This creates a single row for each user, each day. Also, a row for the total number of messages for that day is created. Each entry contains the date, user name, domain name, number of hams, number of spams and total number of messages processed for that user (or for the system). This is an example of what you would see in your table -


day             username        domain          spam    ham     total
2005-12-26      $TOTALS         $TOTALS         106     229     335
2005-12-26      user1           domain.com      11      0       11
2005-12-26      user2           domain.com      25      15      40
2005-12-26      user3           domain.com      27      3       30
2005-12-26      user4           domain.com      0       24      24
2005-12-26      user5           domain.com      1       0       1
2005-12-26      user6           domain.com      0       1       1
etc...

A new entry for each user is created each day, so there is historical information for as long as you want to make it. As I said before this is collected in real-time, so when a new message is processed by SpamAssassin the numbers are automatically increased for an existing entry or a new entry is made. This elevates the need to continually have scripts parse log files.

As always I cannot guarantee that this will work with your installation/configuration of SpamAssassin or that it will not make your system explode, but it has worked for me in a production environment for quiet some time. Also, this can only be used with SpamAssassin 3.1+. The code and more details about installing/configuring it can be seen on the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin

Let me know if you run into problems or have suggestions for modifications.

- James




Reply via email to