http://bugzilla.spamassassin.org/show_bug.cgi?id=2878
------- Additional Comments From [EMAIL PROTECTED] 2004-01-05 01:51 ------- Subject: Re: Identify when plain text and HTML are different in multipart/alternative > >However, it is just a matter of time before spammers make a larger > > database of words, and while it can never fool a well-trained > > Bayesian filter, it may make its signal weaker, so to speak. > > BTW, it's important to note that this is *not* the case. > > When a spammer adds random dictionary words to a spam as a > bayes-buster, those words will be quite rare (since people don't > generally use *all* the words in their language very frequently). � > So they'll most likely have never been seen before in the user's > training. �Words that are not in the training database are ignored. > �So the bayes poison in that case will have no effect. Yeah, that's why I wrote "well-trained", since unfortunately, there are many sites that do not allow individual users to train their own filters, among them my old university. I've seen Bayes filters being successfully attacked several times, and these are probably what spammers are targeting, since it may be a large audience there that never train their own filters. There, it is not too hard to guess what words people will use, and most words will be in the dictionary. It can probably be overcome by certain tricks, because indeed, it doesn't affect those most extreme cases that clearly says "spam" or clearly says "ham", but it can flatten the distribution function somewhat, which would affect reliability. It may be why spammers have rather rare words in their dictionary, the idea is that if a word hits, it will hit well, and one of their words did, but obviously, it didn't help them too much... > What the spammers *should* be doing is figuring out what each > recipient email address has in its training db, and use that text > instead. ;) Uhm, there's a fine line between openly discuss and giving them ideas here, I suppose. I can imagine ways to do that... :-/ So I think there are reasons to work on many fronts... Niels: Thanks for the clarification on the efficiency of the algorithm! When I added "normalized" to my google search, it was because I figured it would be convenient to have a measurement between 0 and 1, and I didn't realize what I found was an algorithm that had a slightly different purpose. Just doing ED/n would probably satisfy what I was looking for... :-) Kjetil ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
