Just to see what the effects would be, I did some crude patches to make the following changes to Bayes SQL:
In the bayes_token table changed username from a VARCHAR to INT, with the idea that we could use the user uid instead of username to identify the user.
Also in the bayes_token changed token from VARCHAR to BIGINT. I patched SQL.pm to convert the token string into the low order 15 hex digits of the SHA-1 hash of the string. By putting a "0x" in front of that in the SELECT, MySQL will treat the string as a 64bit integer even though perl doesn't itself support 64 bit integers.
Those two changes cause bayes_token table to have no variable length fields, which according to the MySQL documentation makes access more efficient. It also reduces the database size quite a bit.
I then added a tok_get_all routine in SQL.pm that uses a SELECT ... FROM bayes_token WITH ... token IN ( ... ) to get all the information from the database at once instead of using a different SELECT query for each token.
I tested this by training Bayes on approximately 1000 ham and 1500 spam, which resulted in about 150,000 tokens, then running 1000 other spam messages through spamc with no network tests and autolearning off.
There was no appreciable change in the time for sa-learn, but of course the database is smaller with the smaller fixed fields.
The baseline test of running 1000 spams through spamc took about 25 minutes on my machine.
After I changes the username and token fields to the integer formats, it took about 17 minutes.
After I then added the the SELECT .... token IN (...) it went down to 14 minutes.
When I turned off Bayes completely, running the same messages through spamc took 5 minutes.
So it looks like if we are willing to sacrifice being able to see the tokens in a readable form when someone dumps the Bayes database, we can makes things about twice as fast and the database a lot smaller.
-- sidney
