I've run some more careful tests comparing the following:
I changed username in the tokens table from varchar to INT and used the bayes override username config option to force it to my unix UID. This was without any change to the code, only to the schema.
I enabled DELAYED_KEY_WRITE.
The test consists of emptying the Bayes database, training it on 960 ham and 1600 spam, then timing five consecutive runs of feeding one set of 1000 spam messages to spamc while running spamd -L.
The variables in the testing were
1a. Take the low order 64 bits of the SHA1 hash of the tokens as a hex string and store it as a BIGINT field
or
2a. Take the low order 40 bits of the SHA1 hash of the tokens and store it as a CHAR(5) BINARY field
The other variable was
1b. Get token values from the database one at a time using the tok_get function
2b. Get token values for a message all at once using a tok_get_all function that looks up all of them using one SELECT ... token IN (?,?,...)
I fond that it is easy to work with arbitrary binary data as CHAR fields if you allow prepare or prepare-cache to handle the quoting when it matches the data to the wildcard ? characters.
The results show not much difference in speed between the BIGINT and CHAR(5) BINARY cases. BIGINT is slightly faster, but requires about 15% more space in the db.
Using tok_get_all does make a difference in speed.
The approximate average times for the tests were 25 minutes using unmodified SpamAssassin, 17.5 minutes with the fixed length fields and tok_get, and 15.5 minutes with fixed length fields and tok_get_all
I also did one run with updating of atime disabled and found almost no difference in speed. I do think that there is another optimization to be dome with that anyway: Does anyone see any reason to expire tokens based on atime with a granularity of less than one day rather than the seconds that we are using now? If atime is in days rather than seconds we could use a two byte in for it instead of four bytes, saving two bytes per token in the database and not updating as often.
I'll create a bugzilla ticket for this and post my test code as a patch when I get a chance to.
-- sidney
