Well the simplest fix is the one that I did not implement in the first
place, using "ON DUPLICATE KEY". However, I did not implement that
because of its only being in version 4.1 of MySQL and I still use Debian
stable for most production machines, which runs 4.0.x.
Anyway, I will poke at it some more.
- James
Jim C. Nasby wrote:
On Tue, Dec 27, 2005 at 09:33:11PM -0500, James Keating wrote:
Indeed. My thinking behind storing both system totals ($TOTALS) and user
totals in the database was for easy error checking. Each day you can
quickly/easily run through the database and look for potential errors
(possibly resulting from the race condition that you mentioned).
However, I have not seen any miscalculations yet.
Well, there won't be any errors with an ACID database (ie: not MyISAM)
and if the race condition is handled properly.
Regarding the race condition, I would be more than happy to fix it, and
any input would be wonderful on that. I will poke around and see what I
can come up with on my own.
I could give you a fix if it was PostgreSQL, but I'm not really a MySQL
person.
Jim C. Nasby wrote:
Neat plugin. I have two comments:
I wouldn't store $TOTALS or the total column in the database, as both
can easily be calculated when retrieving the data.
There is a race condition, especially for $TOTALS. First you check for
existence, then you try and do an insert or an update based on that. If
more than one SA thread is running, you can end up with duplicated
entries. I think MySQL has a merge command that would be more
appropriate.
Also, it appears that this would work on PostgreSQL with minimal
modifications. If anyone's interested in that I'd be happy to help.
On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
Morning All,
I have released a real-time stats collection plugin
(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
The plugin collects data into a MySQL DB based on the date and username
that SpamAssassin was invoked with. This creates a single row for each
user, each day. Also, a row for the total number of messages for that
day is created. Each entry contains the date, user name, domain name,
number of hams, number of spams and total number of messages processed
for that user (or for the system). This is an example of what you would
see in your table -
day username domain spam ham total
2005-12-26 $TOTALS $TOTALS 106 229 335
2005-12-26 user1 domain.com 11 0 11
2005-12-26 user2 domain.com 25 15 40
2005-12-26 user3 domain.com 27 3 30
2005-12-26 user4 domain.com 0 24 24
2005-12-26 user5 domain.com 1 0 1
2005-12-26 user6 domain.com 0 1 1
etc...
A new entry for each user is created each day, so there is historical
information for as long as you want to make it. As I said before this
is collected in real-time, so when a new message is processed by
SpamAssassin the numbers are automatically increased for an existing
entry or a new entry is made. This elevates the need to continually
have scripts parse log files.
As always I cannot guarantee that this will work with your
installation/configuration of SpamAssassin or that it will not make your
system explode, but it has worked for me in a production environment for
quiet some time. Also, this can only be used with SpamAssassin 3.1+.
The code and more details about installing/configuring it can be seen on
the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
Let me know if you run into problems or have suggestions for
modifications.
- James