Re: [jira] Closed: (JAMES-514) Add Mailet that support jason forspamfiltering

Vincenzo Gianferrari Pini Wed, 30 Aug 2006 07:08:32 -0700

I took a deep look to our curent BayesianAnalyzer code and to GaryRobinson's blog and paper(http://garyrob.blogs.com//handlingtokenredundancy94.pdf).

I would be quite simple to change our current code to optionally supportthe proposed extensions (I have already available my own java code toefficiently compute the [inverse] chi-square function).

*But* there is a problem, in case we want to allow the use of thecurrent corpus data. In spite of what Gary says in page 1 of his paper,when computing b(w) and g(w), Paul Graham's approach (both the originalone - http://paulgraham.com/spam.html - and the enhanced one upon whichour current code is based and the corpuses are built -http://paulgraham.com/better.html) collects different data:

for Gary: b(w) = (the number of spam e-mails containing the word w) /(the total number of spam e-mails),while for Paul and us: b(w) = (the total count of occurrences of word win the spam e-mails) / (the total number of spam e-mails).

Similarly for computing g(w), anf f(w), where m = (the number of spame-mails containing the word w) + (the number of ham e-mails containingthe word w).


In fact, in http://paulgraham.com/spam.html  it may be read the following:
...

The especially observant will notice that while I consider each corpusto be a single long stream of text for purposes of counting occurrences,I use the number of emails in each, rather than their combined length,as the divisor in calculating spam probabilities. This adds anotherslight bias to protect against false positives.

...
and that's how I coded the current stuff.

To fully Support Gary's proposal we should collect a third field in thebayesiananalysis_ham and spam tables, containing his different counter,and the corpuses should be rebuilt from scratch.

A different (and straightforward) approach would be to continue to useour b(w) and g(w) formulas, and setting m = (total count of words) whileusing Gary's approach.I feel that the result would be not so different from Gary's, but itshould be carefully studied from a theorethical point of view, speciallywhen computing the confidence tests, that is where Gary's approach isparticularly interesting.

Personally I think that it would be more useful (for now) to keep ourcurrent approach, allowing to vary in config.xml the (currentlyhardcoded to Paul's suggestions) MAX_INTERESTING_TOKENS,INTERESTINGNESS_THRESHOLD and DEFAULT_TOKEN_PROBABILITY.


Let me know your thoughts.

Vincenzo

Noel J. Bergman wrote:

Norman Maurer wrote:

Noel J. Bergman:

I think we will not do this until jasen will get published
under a acceptable license

We already have Bayesian support under a suitable license.  Wouldn't

hurt to

go to the same articles and add the improvements sugggested by Gary

Robinson.

What for suggestions ?


Yes, to look for any suggestions that Gary Robinson made on how to improve
the results of Bayesian analysis, e.g.,
http://garyrob.blogs.com//handlingtokenredundancy94.pdf that aren't already
in our code.

        --- Noel



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Closed: (JAMES-514) Add Mailet that support jason forspamfiltering

Reply via email to