[Bug 2878] Identify when plain text and HTML are different in multipart/alternative

bugzilla-daemon 5 Jan 2004 09:51:19 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=2878






------- Additional Comments From [EMAIL PROTECTED]  2004-01-05 01:51 -------
Subject: Re:  Identify when plain text and HTML are different in 
multipart/alternative

> >However, it is just a matter of time before spammers make a larger
> > database of words, and while it can never fool a well-trained
> > Bayesian filter, it may make its signal weaker, so to speak.
>
> BTW, it's important to note that this is *not* the case.
>
> When a spammer adds random dictionary words to a spam as a
> bayes-buster, those words will be quite rare (since people don't
> generally use *all* the words in their language very frequently). �
> So they'll most likely have never been seen before in the user's
> training. �Words that are not in the training database are ignored.
> �So the bayes poison in that case will have no effect.

Yeah, that's why I wrote "well-trained", since unfortunately, there are 
many sites that do not allow individual users to train their own 
filters, among them my old university. I've seen Bayes filters being 
successfully attacked several times, and these are probably what 
spammers are targeting, since it may be a large audience there that 
never train their own filters. 

There, it is not too hard to guess what words people will use, and most 
words will be in the dictionary. It can probably be overcome by certain 
tricks, because indeed, it doesn't affect those most extreme cases that 
clearly says "spam" or clearly says "ham", but it can flatten the 
distribution function somewhat, which would affect reliability. 

It may be why spammers have rather rare words in their dictionary, the 
idea is that if a word hits, it will hit well, and one of their words 
did, but obviously, it didn't help them too much...

> What the spammers *should* be doing is figuring out what each
> recipient email address has in its training db, and use that text
> instead. ;)

Uhm, there's a fine line between openly discuss and giving them ideas 
here, I suppose. I can imagine ways to do that... :-/ So I think there 
are reasons to work on many fronts... 

Niels: Thanks for the clarification on the efficiency of the algorithm! 
When I added "normalized" to my google search, it was because I figured 
it would be convenient to have a measurement between 0 and 1, and I 
didn't realize what I found was an algorithm that had a slightly 
different purpose. Just doing ED/n would probably satisfy what I was 
looking for... :-) 

Kjetil





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 2878] Identify when plain text and HTML are different in multipart/alternative

Reply via email to