Do you have a benchmark script?
Not really. I initialize an empty database by running sa-learn on a folder of 967 ham message files (Maildir format) and then running sa-learn on another 1641 spam message files.
After that, with bayes_auto-learn 0 in user_prefs, I run spamd -L and just do
time for i in directory_of_another_1000_spams/* ; do spamc -R < $i | grep BAYES ; echo $i ; done
Oh, I just thought of something... Is there an optimization that does not update the atime of a token if the time increment is smaller than some amount like an hour or a day? If there is, that would explain the first run through after I have initialized the database taking longer than all others. If we don't do that we should. It would save most updates on the most frequent significant tokens.
I think we might want to see test results without tok_get_all change. Making small steps so we can see exactly what performance gains are made.
Yes, my plan was that after I got the code working I would wrap them in config options to make testing different combinations easier and systematically get some numbers. I just ran the one test as I implemented each, and reported those early results. But since I lack patience :-) now that I see that the results might have been an artifact, as soon as I get the chance I will comment out the call to tok_get_all, uncomment the version that calls tok_get for each token, and see what repeated runs of that using the fixed length fields do.
Keep it at 0, my understanding of how the MySQL query cache works is different than I'm used to with Oracle. So it doesn't matter for MySQL.
Ok, reading about it in the MySQL doc I see that it benefits if you have repeated identical queries. A single message generates no repeated queries, since the token array is uniquified. In a production environment, you probably will fill the cache with queries with different username values before seeing any repeats. So it would would not be worth it.
-- sidney
