Am Mittwoch, den 30.08.2006, 16:08 +0200 schrieb Vincenzo Gianferrari
Pini:
> I took a deep look to our curent BayesianAnalyzer code and to Gary 
> Robinson's blog and paper 
> (http://garyrob.blogs.com//handlingtokenredundancy94.pdf).
> 
> I would be quite simple to change our current code to optionally support 
> the proposed extensions (I have already available my own java code to 
> efficiently compute the [inverse] chi-square function).
> 
> *But* there is a problem, in case we want to allow the use of the 
> current corpus data. In spite of what Gary says in page 1 of his paper, 
> when computing b(w) and g(w), Paul Graham's approach (both the original 
> one - http://paulgraham.com/spam.html - and the enhanced one upon which 
> our current code is based and the corpuses are built - 
> http://paulgraham.com/better.html) collects different data:
> 
> for Gary: b(w) = (the number of spam e-mails containing the word w) / 
> (the total number of spam e-mails),
> while for Paul and us: b(w) = (the total count of occurrences of word w 
> in the spam e-mails) / (the total number of spam e-mails).
> 
> Similarly for computing g(w), anf f(w), where m =  (the number of spam 
> e-mails containing the word w) + (the number of ham e-mails containing 
> the word w).
> 
> In fact, in http://paulgraham.com/spam.html  it may be read the following:
> ...
> The especially observant will notice that while I consider each corpus 
> to be a single long stream of text for purposes of counting occurrences, 
> I use the number of emails in each, rather than their combined length, 
> as the divisor in calculating spam probabilities. This adds another 
> slight bias to protect against false positives.
> ...
> and that's how I coded the current stuff.
> 
> To fully Support Gary's proposal we should collect a third field in the 
> bayesiananalysis_ham and spam tables, containing his different counter, 
> and the corpuses should be rebuilt from scratch.

+1 That sounds like the "right" way.

> 
> A different (and straightforward) approach would be to continue to use 
> our b(w) and g(w) formulas, and setting m = (total count of words) while 
> using Gary's approach.
> I feel that the result would be not so different from Gary's, but it 
> should be carefully studied from a theorethical point of view, specially 
> when computing the confidence tests, that is where Gary's approach is 
> particularly interesting.
> 
> Personally I think that it would be more useful (for now) to keep our 
> current approach, allowing to vary in config.xml the (currently 
> hardcoded to Paul's suggestions) MAX_INTERESTING_TOKENS, 
> INTERESTINGNESS_THRESHOLD and DEFAULT_TOKEN_PROBABILITY.
> 
> Let me know your thoughts.
> 
> Vincenzo
> 
> Noel J. Bergman wrote:
> 
> >Norman Maurer wrote:
> >
> >  
> >
> >>Noel J. Bergman:
> >>    
> >>
> >>>>I think we will not do this until jasen will get published
> >>>>under a acceptable license
> >>>>        
> >>>>
> >>>We already have Bayesian support under a suitable license.  Wouldn't
> >>>      
> >>>
> >hurt to
> >  
> >
> >>>go to the same articles and add the improvements sugggested by Gary
> >>>      
> >>>
> >Robinson.
> >  
> >
> >>What for suggestions ?
> >>    
> >>
> >
> >Yes, to look for any suggestions that Gary Robinson made on how to improve
> >the results of Bayesian analysis, e.g.,
> >http://garyrob.blogs.com//handlingtokenredundancy94.pdf that aren't already
> >in our code.
> >
> >     --- Noel
> >
bye
Norman

Attachment: signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil

Reply via email to