> 1) Has anyone tried classifing, or keeping track of unknown 
> HTML tag names? Say for instance, some spam has the following 
> html V<asdfsdf><br>I<asdfsdf><br>G<asdfsdf><br>R<asdfsdf><br>A
> What if a token was added named "Unknown:Html" or something 
> like that (because of the <asdfsdf> tags ?

At the moment we basically throw all HTML tags away - the above would result
in a "VIGRA" tag, which would probably be pretty spammy (those with
legitimate mail about viagra would probably spell it correctly).

You could try doing this if you liked.  You'd have to have a list of all
'valid' or 'known' HTML tags, of course.  I suspect that spammers that break
up text like this use valid tags anyway (e.g. 'c<i></i>a<b></b>t'), so it
wouldn't have much effect.  The only way to know is to test, though!

> also, I ran across this link today:
> 
> 2) http://mmmservices.web.cern.ch/mmmservices/AntiSpam/
> Basically they filter out the really small fonts between 
> individual characters (look at evolution 3)
> Would a technique like this be benneficial?

Test it, and you'll know <0.5 wink>.  Once you start going down this road,
though (converting to 'eye space'), it'll be hard to stop, and you'll be
essentially writing a mail client.  It's possible that these techniques
result in more spammy clues, anyway (who gets munged text like that in
legitimate mail?) - or at least if the exact token hasn't been seen before,
the word will be ignored, which tends towards more unsure mail than more
ham.

> Also, if I want to test some type of technique, what levels 
> of spam filtering/fp/fn are people getting? What percentage 
> points should I shoot for?

What you should be aiming for is to decrease whatever fp/fn/unsures you get
without the patch.  It doesn't really matter what you were getting before,
just that things improve.  If you find something that does improve results,
then post the patch & your results here, and try and convince other people
to try it out on their corpora - you'll probably get at least one taker.

If results improve (from whatever to whatever) for several people, then
there's a good chance that we'll add it in (probably as an experimental
option) for the next release.  If it really seems like it works for most
people, then it can become a 'real' option, and possibly one day be on by
default.

=Tony.Meyer

_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to