Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Vincenzo Gianferrari Pini Sun, 17 Sep 2006 04:45:01 -0700

Dear Gary,

at the Apache James Server Project (http://james.apache.org) we haveimplemented a (java based) bayesian anti-spam filter following PaulGraham's approach (both the original one -http://paulgraham.com/spam.html - and the enhanced one -http://paulgraham.com/better.html). It is available in the new 2.3.0release that we are releasing these days.

We would like, for the next release, to implement the "chi-square-basedspam filter" approach described in your "Handling Redundancy in EmailToken Probabilities" paper(http://garyrob.blogs.com//handlingtokenredundancy94.pdf). But for doingthat we need to understand a few points: can you help and advice us?

I'm CCing this email to the server-dev@james.apache.org list: can youreply to it in your answer?

Here follow our questions. I will explicitly refer to the terminologyand formula numbers used by you in your above mentioned paper.


  1. Based on Paul Graham's approach, in computing b(w) and g(w) we use
     in the numerator of the formulas "the total count of occurrences
     of word w in the spam (ham) e-mails" instead of "the number of
     spam (ham) e-mails containing the word w" as you do. Paul's
     counters are persisted on disk, and there are already some users
     that have extensively trained their systems building their own
     "corpuses". It would be a pity not to be able to use such
     collected data when using your approach (we would like to simply
     add a configuration switch to our filters that optionally - or by
     default - activates your approach).
     In your blog
     (http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
     I found the following comment from you:

         "Note 2: In calculating p(w) Graham counts every instance of
         word w in every email in which it appears. Obviously, if a
         word appears once in an email, there is a greater probability
         that it will appear again in that same email than if it hadn't
         already appeared at all. So, the random variable is not really
         independent under Graham's technique which is one reason why,
         in the description above, we only count the first occurrence.
         However, we are pragmatic and whatever works best in practice
         is what we should do. There is some evidence at this point
         that using the Graham counting technique leads to slightly
         better results than using the "pure" technique above. This may
         because it is not ignoring any of the data. So, p(w) and n
         should simply be computed the way that gives the best results."

     Looks quite clear to me; can you then confirm us that we can use
     Paul's counters in computing b(w) and g(w), and that you "endorse"
     it as leading "to slightly better results" than the technique
     mentioned in your paper?

  2. In computing f(w) in formula (1), what do you suggest to use for
     the values of "s" and "x"? We will let them be configuration
     parameters as others, but we should use a sound default.

  3. In computing "H" and "S" in formulas (2) and (3), which of the two
     definitions of the "inverse chi-square function" and related cdf
     (invC( ) below) should we use among the two definitions that I
     found for example in
     http://en.wikipedia.org/wiki/Inverse-chi-square_distribution?

  4. Still in computing "H" and "S", how many degrees of freedom "n"
     should we use? I would assume *one*, being two values (ham and
     spam) minus one constraint, as their probabilities must sum up to
     one (in this case question 3 above would be pointless). But I'n
     not sure at all: can you either confirm or give me a hint?

  5. As we have already available a java routine that computes the
     chi-square cdf C(X,n), I found out that to compute the invC(X,n)
     we could use one of the following formulas (depending on the
     outcome of question 3 above). Would you be able to help me
     confirming it?:
     invC(X,n) = 1 - C(1/X,n)
     invC(X,n) = 1 - C(n/X,n)

  6. What do you suggest for a practical default "cutoff" value for "I"
     and as default ESF value "y" in formula (6) for "H", and which "y"
     for "S"? And which default for the "exclusion radius"?

I hope that you do not find those questions as being too many :-) .

I'm looking forward for your answer .

If you are interested we will keep you informed about our future progress.

Thank you.

Vincenzo Gianferrari Pini


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Reply via email to