I took a deep look to our curent BayesianAnalyzer code and to Gary Robinson's blog and paper (http://garyrob.blogs.com//handlingtokenredundancy94.pdf).

I would be quite simple to change our current code to optionally support the proposed extensions (I have already available my own java code to efficiently compute the [inverse] chi-square function).

*But* there is a problem, in case we want to allow the use of the current corpus data. In spite of what Gary says in page 1 of his paper, when computing b(w) and g(w), Paul Graham's approach (both the original one - http://paulgraham.com/spam.html - and the enhanced one upon which our current code is based and the corpuses are built - http://paulgraham.com/better.html) collects different data:

for Gary: b(w) = (the number of spam e-mails containing the word w) / (the total number of spam e-mails), while for Paul and us: b(w) = (the total count of occurrences of word w in the spam e-mails) / (the total number of spam e-mails).

Similarly for computing g(w), anf f(w), where m = (the number of spam e-mails containing the word w) + (the number of ham e-mails containing the word w).

In fact, in http://paulgraham.com/spam.html  it may be read the following:
...
The especially observant will notice that while I consider each corpus to be a single long stream of text for purposes of counting occurrences, I use the number of emails in each, rather than their combined length, as the divisor in calculating spam probabilities. This adds another slight bias to protect against false positives.
...
and that's how I coded the current stuff.

To fully Support Gary's proposal we should collect a third field in the bayesiananalysis_ham and spam tables, containing his different counter, and the corpuses should be rebuilt from scratch.

A different (and straightforward) approach would be to continue to use our b(w) and g(w) formulas, and setting m = (total count of words) while using Gary's approach. I feel that the result would be not so different from Gary's, but it should be carefully studied from a theorethical point of view, specially when computing the confidence tests, that is where Gary's approach is particularly interesting.

Personally I think that it would be more useful (for now) to keep our current approach, allowing to vary in config.xml the (currently hardcoded to Paul's suggestions) MAX_INTERESTING_TOKENS, INTERESTINGNESS_THRESHOLD and DEFAULT_TOKEN_PROBABILITY.

Let me know your thoughts.

Vincenzo

Noel J. Bergman wrote:

Norman Maurer wrote:

Noel J. Bergman:
I think we will not do this until jasen will get published
under a acceptable license
We already have Bayesian support under a suitable license.  Wouldn't
hurt to
go to the same articles and add the improvements sugggested by Gary
Robinson.
What for suggestions ?

Yes, to look for any suggestions that Gary Robinson made on how to improve
the results of Bayesian analysis, e.g.,
http://garyrob.blogs.com//handlingtokenredundancy94.pdf that aren't already
in our code.

        --- Noel



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to