Re: Bayes schema questions

Sidney Markowitz 13 Mar 2004 05:46:47 -0000

Michael Parker wrote:

3) Using IN is problematic because you have to limit the number of
   elements you have, at least in Oracle, not totally sure about other
   databases.  Also, you can completely blow your query cache building
   up random dynamic select queries like this.  I suggest that if we
   do this then you limit it to some number X and do as many queries
   with X as the max that it takes to get all of the data.  It will
   take some experimentation to determine the optimal value for X.

That brings up an important point. The query is constructed from a parsing of the email, which may be carefully constructed by a spammer who will know what will be done with it. The average email in my corpus has 262 tokens. What would happen if a spammer creates something with very many unique short tokens that totals to less than the 256kbyte message length limit we usually use for what to scan? For that matter, what would it do to performance if a spammer included 20,000 random four character words in an invisible font? That's another argument for attempting to avoid processing invisible "bayes poison" words.

Speaking of poison, does the method we use to insert the token in the SELECT query string protect against SQL injection attacks by making sure that the token cannot contain some special characters that will add to the query string? Reading some documentation on DBI's prepared_cache method it appears that it takes care of quoting what goes into the placeholders, but does it take care of all SQL injection issues?

4) Select is fast, it's update that is slow.

That's one reason I would like to see something done to separate scanning and training. Perhaps there is nothing that has to be done... As long as auto-learning is optional, is can be disabled when Bayes is run for scoring, then training can be done separately. In fact high volume installations can use train on error so the training volume will be very low. If Select is so much faster than update, that will work out perfectly.

-- sidney

Re: Bayes schema questions

Reply via email to