3) Using IN is problematic because you have to limit the number of elements you have, at least in Oracle, not totally sure about other databases. Also, you can completely blow your query cache building up random dynamic select queries like this. I suggest that if we do this then you limit it to some number X and do as many queries with X as the max that it takes to get all of the data. It will take some experimentation to determine the optimal value for X.
That brings up an important point. The query is constructed from a parsing of the email, which may be carefully constructed by a spammer who will know what will be done with it. The average email in my corpus has 262 tokens. What would happen if a spammer creates something with very many unique short tokens that totals to less than the 256kbyte message length limit we usually use for what to scan? For that matter, what would it do to performance if a spammer included 20,000 random four character words in an invisible font? That's another argument for attempting to avoid processing invisible "bayes poison" words.
Speaking of poison, does the method we use to insert the token in the SELECT query string protect against SQL injection attacks by making sure that the token cannot contain some special characters that will add to the query string? Reading some documentation on DBI's prepared_cache method it appears that it takes care of quoting what goes into the placeholders, but does it take care of all SQL injection issues?
4) Select is fast, it's update that is slow.
That's one reason I would like to see something done to separate scanning and training. Perhaps there is nothing that has to be done... As long as auto-learning is optional, is can be disabled when Bayes is run for scoring, then training can be done separately. In fact high volume installations can use train on error so the training volume will be very low. If Select is so much faster than update, that will work out perfectly.
-- sidney
