Actually, I made several changes to Chris Mean's submission, and I think that my 
version is right now quite different from Danny's.

The changes were:

1) Chris' submission was implementing Paul Grahams' "A Plan for Spam" 
(http://www.paulgraham.com/spam.html). I upgraded to Paul Grahams' "Better Bayesian 
Filtering" (http://paulgraham.com/better.html), that includes:

1.1) Tokens:

1.1.1) Case is preserved.

1.1.2) Exclamation points are constituent characters.

1.1.3) Periods and commas are constituents if they occur between two digits. This lets 
me get ip addresses and prices intact.

1.1.4) Tokens that occur within the To, From, Subject, and Return-Path lines, or 
within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes 
``Subject*foo''. (The asterisk could be any character you don't allow as a 
constituent.) 

1.1.5) Added "degeneration": if for example the filter sees ``FREE!!!'' in the Subject 
line and doesn't have a probability for it, it will search the corpus in the following 
order: Subject*Free!!! Subject*free!!! Subject*FREE! Subject*Free! Subject*free! 
Subject*FREE Subject*Free Subject*free FREE!!! Free!!! free!!! FREE! Free! free FREE 
Free free.

1.2) Added new probability thresholds to .99: .9999 and .9998.

1.3) The number of tokens to consider calculating the bayesian probability is no 
longer a fixed number (was 15), but controlled by an "interestingness threshold" 
probability.

2) Found and fixed two subtle but major bugs in managing the probabilities (some 
"important" tokens were ignored: it many tokens had the same probability only the 
first one was considered for building the bayesian probability).

3) Added extensively code to get a safer database and error management. As Mordred was 
causing me problems with too long connections during training, I added an optional 
support for using a connection directly acquired from jdbc. It may be useless now with 
DBCP.

4) Added also a prefix to the message analysed and an "X-MessageIsSpamProbability" 
header.

5) The "rebuild spam corpus" command bounces back a message confirming the operation.

6) Some minor things:

6.1) "All digits" tokens are ignored.

6.2) Tokens with digits and containing '.' or ',' are considered as a whole (numbers).

6.3) Tokens like 35$ ?35.35 etc. are considered

6.4) IP numbers are considered intact (i.e. not broken by the periods).

6.5) After some experiments, I decided to consider tokens of length 1 to 90: I found 
it more effective. The corpus size is somewhere between 2 and 5 MB, that I consider 
quite acceptable.

Danny, what are your changes other than the token size? If they are not too many, I 
could add to my version; otherwise it's going to be a problem doing a merge.

Vincenzo



> -----Original Message-----
> From: Noel J. Bergman [mailto:[EMAIL PROTECTED]
> Sent: martedi 26 agosto 2003 18.50
> To: James-Dev Mailing List
> Subject: Bayesian filtering and MailAddress validation
> 
> 
> Seems like in Sept., after your crunch and Vincenzo returns from vacation,
> that the two of you should merge your changes (your changes sound
> parameterizable), and maybe get it into CVS.
> 
> If you want to send me a JAR and instructions, I've already got reposting
> from mbox working, although I am finding some real-world issues, e.g.,
> roughly half of the messages in the target set have
> 
>   To: [EMAIL PROTECTED]
> 
> instead of
> 
>   To: <[EMAIL PROTECTED]>
> 
> which is rejected by the MailAddress class.  None of them appear 
> to be user
> messages, but all of them seem to be bounce notices from places like
> CompuServe and apps like CC Mail Server.  I'm curious to know 
> from John Webb
> or Steven Short if they are seeing similar problems with their 
> mailing list
> managers.
> 
> It is possible that the To: field is broken, but the RCPT TO was properly
> formatted.  RFC 822 had some bad examples, although RFC 821 was clear from
> the start, and RFC 2822 is clear.  We might want to account for 
> this in our
> Fetch services.
> 
>         --- Noel
> 
> -----Original Message-----
> From: Danny Angus [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 26, 2003 11:23
> To: Noel J. Bergman
> 
> 
> > I can build the current James v2 SAR.  Are you using anything other than
> > what Vincenzo has on his site, or should I just use the 
> download from his
> > site?
> 
> I've got a different take on Chris Means' submission, I've not tried
> Vincenzos but in theory it should be much the same.
> 
> Mine is optimised to keep the corpus size low by ignoring tokens < 4 chars
> and > 15 and ignoring tokens with probabilities in the range 4-6
> (neutral)
> 
> d.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to