Re: Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Vincenzo Gianferrari Pini Tue, 19 Sep 2006 02:19:00 -0700

Hi Gary,

I had a look at your suggested links, that answered to some of myquestions: see below.


Gary Robinson wrote:

Hi Vincenzo --

<snip/>

Bogofilter is the only one currently using the "handling redundancy"extension to the original technique. (Seehttp://www.bgl.nu/bogofilter/esf.html for more info.) I think it's onlyused optionally though; I think the default is the original techniquedescribed here in my Linux Journal article:http://www.linuxjournal.com/article/6467. It increases filteringaccuracy very significantly as you can see from that bogofilter link,but does take a fair amount of cpu time to find the optimal parms, andspending that time is crucial to make it work.

<snip/>

Anyway, those are thoughts that come to mind. I'll respond to yourspecific question as soon as I can -- I've got a heavy meeting anddeadline schedule next week. Also, I'm not sure if you saw the LinuxJournal article mentioned above -- if you haven't you might want tolook at it. If so let me know if it resolves any of your questions.
Gary






Gary Robinson
CTO
Emergent Music, LLC
[EMAIL PROTECTED]
207-942-3463
Company: http://www.goombah.com
Blog:    http://www.garyrobinson.net

On Sun, 17 Sep 2006 13:44:19 +0200, Vincenzo Gianferrari Pini wrote:
Dear Gary,
at the Apache James Server Project (http://james.apache.org) we haveimplemented a (java based) bayesian anti-spam filter following PaulGraham's approach (both the original one -http://paulgraham.com/spam.html - and the enhanced one -http://paulgraham.com/better.html). It is available in the new 2.3.0release that we are releasing these days.
We would like, for the next release, to implement the"chi-square-based spam filter" approach described in your "HandlingRedundancy in Email Token Probabilities" paper(http://garyrob.blogs.com//handlingtokenredundancy94.pdf). But fordoing that we need to understand a few points: can you help andadvice us?
I'm CCing this email to the server-dev@james.apache.org list: can youreply to it in your answer?
Here follow our questions. I will explicitly refer to the terminologyand formula numbers used by you in your above mentioned paper.
 1. Based on Paul Graham's approach, in computing b(w) and g(w) we use
    in the numerator of the formulas "the total count of occurrences
    of word w in the spam (ham) e-mails" instead of "the number of
    spam (ham) e-mails containing the word w" as you do. Paul's
    counters are persisted on disk, and there are already some users
    that have extensively trained their systems building their own
    "corpuses". It would be a pity not to be able to use such
    collected data when using your approach (we would like to simply
    add a configuration switch to our filters that optionally - or by
    default - activates your approach).
    In your blog
(http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
    I found the following comment from you:

        "Note 2: In calculating p(w) Graham counts every instance of
        word w in every email in which it appears. Obviously, if a
        word appears once in an email, there is a greater probability
        that it will appear again in that same email than if it hadn't
        already appeared at all. So, the random variable is not really
        independent under Graham's technique which is one reason why,
        in the description above, we only count the first occurrence.
        However, we are pragmatic and whatever works best in practice
        is what we should do. There is some evidence at this point
        that using the Graham counting technique leads to slightly
        better results than using the "pure" technique above. This may
        because it is not ignoring any of the data. So, p(w) and n
        should simply be computed the way that gives the best results."

    Looks quite clear to me; can you then confirm us that we can use
    Paul's counters in computing b(w) and g(w), and that you "endorse"
    it as leading "to slightly better results" than the technique
    mentioned in your paper?

 2. In computing f(w) in formula (1), what do you suggest to use for
    the values of "s" and "x"? We will let them be configuration
    parameters as others, but we should use a sound default.

For "x" a good starting point might be 0.5, or the average p(w) forwords in the training set that have counts of 10 or more. For "s" seequestion 6 below.

 3. In computing "H" and "S" in formulas (2) and (3), which of the two
    definitions of the "inverse chi-square function" and related cdf
    (invC( ) below) should we use among the two definitions that I
    found for example in
    http://en.wikipedia.org/wiki/Inverse-chi-square_distribution?

In your "A Python Implementation of the Inverse Chi-Square Function"(http://www.linuxjournal.com/articles/lj/0107/6467/6467s2.html) you showfor such "inverse" an algorithm that is simply equivalent to "one minusthe traditional chi-square cumulative distribution function":

invC(X,n) = 1.0 - C(X,n)

that simply is P(X < Chi2) = 1.0 - P(Chi2 <= X)

If this is the case, and I understood well, IMHO you should not use inyour papers the term "Inverse Chi-Square Function", but simply use theterm "Chi-Square Function" defining it as "P(Chi2 <= X)", and use "1.0 -C(...)" in your formulas, as the word "inverse" has different meaningsand can be misleading (at least it was for me).For example see the wikipedia definition above, and notice that inMicrosoft Excel the function CHIDIST(X,n) is equivalent to ourinvC(X,n) above, and Excel's CHIINV(P,n) is quite different, meaning

X = CHIINV(CHIDIST(X,n),n)

 4. Still in computing "H" and "S", how many degrees of freedom "n"
    should we use? I would assume *one*, being two values (ham and
    spam) minus one constraint, as their probabilities must sum up to
    one (in this case question 3 above would be pointless). But I'n
    not sure at all: can you either confirm or give me a hint?

"n" is "the number of unique tokens in the message" (I would add "thatfall outside the exclusion radius (called 'mindev' by Greg)").

 5. As we have already available a java routine that computes the
    chi-square cdf C(X,n), I found out that to compute the invC(X,n)
    we could use one of the following formulas (depending on the
    outcome of question 3 above). Would you be able to help me
    confirming it?:
    invC(X,n) = 1 - C(1/X,n)
    invC(X,n) = 1 - C(n/X,n)

This question disappears after the answer to question 3 above.

 6. What do you suggest for a practical default "cutoff" value for "I"
    and as default ESF value "y" in formula (6) for "H", and which "y"
    for "S"? And which default for the "exclusion radius"?

Greg comes out with some good suggested values (including for "s"),though they show that they depend quite a lot on the training set used,and this disturbs me a little bit. I would have expected to find someinvariance based on a common perception of "spamminess".

In addition to that, IMHO we should use an "S < 0.001" or lessconstraint to minimize as much as possible false positives. I think thatthe possibility to assess the risk of false positives is the mostimportant thing in all your chi-square approach.

I hope that you do not find those questions as being too many :-) .

I'm looking forward for your answer .

If you are interested we will keep you informed about our future progress.

Thank you.

Vincenzo Gianferrari Pini

What do you think about the above considerations?

Thanks,

Vincenzo


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Reply via email to