Just found the following in Gary Robinson's blog (http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html):

<snip>
Note 2: In calculating p(w) Graham counts every instance of word w in every email in which it appears. Obviously, if a word appears once in an email, there is a greater probability that it will appear again in that same email than if it hadn't already appeared at all. So, the random variable is not really independent under Graham's technique which is one reason why, in the description above, we only count the first occurrence. However, we are pragmatic and whatever works best in practice is what we should do. There is some evidence at this point that using the Graham counting technique leads to slightly better results than using the "pure" technique above. This may because it is not ignoring any of the data. So, p(w) and n should simply be computed the way that gives the best results.
</snip>

It means that Gary endorses using the Graham counting technique as leading "to slightly better results" than the technique used in his paper.

It means that we can (and should) use our counters while using the rest of his approach (my second - and straightforward - suggested approach in this thread).

Do you agree, or I'm missing something :-) ?

If all the above is true, I'll definitely start looking more in the near future on how to implement the chi-square technique described by Gary in his paper (http://garyrob.blogs.com//handlingtokenredundancy94.pdf). As I said, it should be relatively easy.

Perhaps, as you suggest, I could doublecheck with Gary.

Advice!

Vincenzo

Noel J. Bergman wrote:

Vincenzo,

I have not looked deeply at the algorithms or implementation.  You and a few
others have spent considerable time at it.  Therefore I will defer to your
best judgement.

Since you've clarified a specific difference in the computations, does it
make sense to engage Gary Robinson and Paul Graham in discussion about the
difference(s) you've noted?

        --- Noel



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to