Just found the following in Gary Robinson's blog
(http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html):
<snip>
Note 2: In calculating p(w) Graham counts every instance of word w in
every email in which it appears. Obviously, if a word appears once in an
email, there is a greater probability that it will appear again in that
same email than if it hadn't already appeared at all. So, the random
variable is not really independent under Graham's technique which is one
reason why, in the description above, we only count the first
occurrence. However, we are pragmatic and whatever works best in
practice is what we should do. There is some evidence at this point that
using the Graham counting technique leads to slightly better results
than using the "pure" technique above. This may because it is not
ignoring any of the data. So, p(w) and n should simply be computed the
way that gives the best results.
</snip>
It means that Gary endorses using the Graham counting technique as
leading "to slightly better results" than the technique used in his paper.
It means that we can (and should) use our counters while using the rest
of his approach (my second - and straightforward - suggested approach in
this thread).
Do you agree, or I'm missing something :-) ?
If all the above is true, I'll definitely start looking more in the near
future on how to implement the chi-square technique described by Gary in
his paper (http://garyrob.blogs.com//handlingtokenredundancy94.pdf). As
I said, it should be relatively easy.
Perhaps, as you suggest, I could doublecheck with Gary.
Advice!
Vincenzo
Noel J. Bergman wrote:
Vincenzo,
I have not looked deeply at the algorithms or implementation. You and a few
others have spent considerable time at it. Therefore I will defer to your
best judgement.
Since you've clarified a specific difference in the computations, does it
make sense to engage Gary Robinson and Paul Graham in discussion about the
difference(s) you've noted?
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]