I took a deep look to our curent BayesianAnalyzer code and to Gary
Robinson's blog and paper
(http://garyrob.blogs.com//handlingtokenredundancy94.pdf).
I would be quite simple to change our current code to optionally support
the proposed extensions (I have already available my own java code to
efficiently compute the [inverse] chi-square function).
*But* there is a problem, in case we want to allow the use of the
current corpus data. In spite of what Gary says in page 1 of his paper,
when computing b(w) and g(w), Paul Graham's approach (both the original
one - http://paulgraham.com/spam.html - and the enhanced one upon which
our current code is based and the corpuses are built -
http://paulgraham.com/better.html) collects different data:
for Gary: b(w) = (the number of spam e-mails containing the word w) /
(the total number of spam e-mails),
while for Paul and us: b(w) = (the total count of occurrences of word w
in the spam e-mails) / (the total number of spam e-mails).
Similarly for computing g(w), anf f(w), where m = (the number of spam
e-mails containing the word w) + (the number of ham e-mails containing
the word w).
In fact, in http://paulgraham.com/spam.html it may be read the following:
...
The especially observant will notice that while I consider each corpus
to be a single long stream of text for purposes of counting occurrences,
I use the number of emails in each, rather than their combined length,
as the divisor in calculating spam probabilities. This adds another
slight bias to protect against false positives.
...
and that's how I coded the current stuff.
To fully Support Gary's proposal we should collect a third field in the
bayesiananalysis_ham and spam tables, containing his different counter,
and the corpuses should be rebuilt from scratch.
A different (and straightforward) approach would be to continue to use
our b(w) and g(w) formulas, and setting m = (total count of words) while
using Gary's approach.
I feel that the result would be not so different from Gary's, but it
should be carefully studied from a theorethical point of view, specially
when computing the confidence tests, that is where Gary's approach is
particularly interesting.
Personally I think that it would be more useful (for now) to keep our
current approach, allowing to vary in config.xml the (currently
hardcoded to Paul's suggestions) MAX_INTERESTING_TOKENS,
INTERESTINGNESS_THRESHOLD and DEFAULT_TOKEN_PROBABILITY.
Let me know your thoughts.
Vincenzo
Noel J. Bergman wrote:
Norman Maurer wrote:
Noel J. Bergman:
I think we will not do this until jasen will get published
under a acceptable license
We already have Bayesian support under a suitable license. Wouldn't
hurt to
go to the same articles and add the improvements sugggested by Gary
Robinson.
What for suggestions ?
Yes, to look for any suggestions that Gary Robinson made on how to improve
the results of Bayesian analysis, e.g.,
http://garyrob.blogs.com//handlingtokenredundancy94.pdf that aren't already
in our code.
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]