Am Mittwoch, den 30.08.2006, 16:08 +0200 schrieb Vincenzo Gianferrari Pini: > I took a deep look to our curent BayesianAnalyzer code and to Gary > Robinson's blog and paper > (http://garyrob.blogs.com//handlingtokenredundancy94.pdf). > > I would be quite simple to change our current code to optionally support > the proposed extensions (I have already available my own java code to > efficiently compute the [inverse] chi-square function). > > *But* there is a problem, in case we want to allow the use of the > current corpus data. In spite of what Gary says in page 1 of his paper, > when computing b(w) and g(w), Paul Graham's approach (both the original > one - http://paulgraham.com/spam.html - and the enhanced one upon which > our current code is based and the corpuses are built - > http://paulgraham.com/better.html) collects different data: > > for Gary: b(w) = (the number of spam e-mails containing the word w) / > (the total number of spam e-mails), > while for Paul and us: b(w) = (the total count of occurrences of word w > in the spam e-mails) / (the total number of spam e-mails). > > Similarly for computing g(w), anf f(w), where m = (the number of spam > e-mails containing the word w) + (the number of ham e-mails containing > the word w). > > In fact, in http://paulgraham.com/spam.html it may be read the following: > ... > The especially observant will notice that while I consider each corpus > to be a single long stream of text for purposes of counting occurrences, > I use the number of emails in each, rather than their combined length, > as the divisor in calculating spam probabilities. This adds another > slight bias to protect against false positives. > ... > and that's how I coded the current stuff. > > To fully Support Gary's proposal we should collect a third field in the > bayesiananalysis_ham and spam tables, containing his different counter, > and the corpuses should be rebuilt from scratch.
+1 That sounds like the "right" way. > > A different (and straightforward) approach would be to continue to use > our b(w) and g(w) formulas, and setting m = (total count of words) while > using Gary's approach. > I feel that the result would be not so different from Gary's, but it > should be carefully studied from a theorethical point of view, specially > when computing the confidence tests, that is where Gary's approach is > particularly interesting. > > Personally I think that it would be more useful (for now) to keep our > current approach, allowing to vary in config.xml the (currently > hardcoded to Paul's suggestions) MAX_INTERESTING_TOKENS, > INTERESTINGNESS_THRESHOLD and DEFAULT_TOKEN_PROBABILITY. > > Let me know your thoughts. > > Vincenzo > > Noel J. Bergman wrote: > > >Norman Maurer wrote: > > > > > > > >>Noel J. Bergman: > >> > >> > >>>>I think we will not do this until jasen will get published > >>>>under a acceptable license > >>>> > >>>> > >>>We already have Bayesian support under a suitable license. Wouldn't > >>> > >>> > >hurt to > > > > > >>>go to the same articles and add the improvements sugggested by Gary > >>> > >>> > >Robinson. > > > > > >>What for suggestions ? > >> > >> > > > >Yes, to look for any suggestions that Gary Robinson made on how to improve > >the results of Bayesian analysis, e.g., > >http://garyrob.blogs.com//handlingtokenredundancy94.pdf that aren't already > >in our code. > > > > --- Noel > > bye Norman
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil