[Skip]
> I'm still fiddling around with these spams that have a bunch of one- 
> letter
> words hiding drugs for sale:
>
>     V k I p A m G i R u A v
>     V j A v L s I t U w M g
>     X g A f N a A f X q
>     C x I e A a L g I c S l

I will try your sf patch with newer mail soon, honest! :)

[...]
> I don't think there's much to grab onto in the benign text section,  
> however
> the url tends to vary a lot and the domain name generally seems  
> very new.
> For instance, according to whois, the above domain was created on  
> April
> 28th.  I received the spam it contained on April 30th.  The others  
> of this
> ilk I've looked at were also new domains.  That suggests to me a  
> couple
> possibilities:
>
>     * look up the age of the domains via whois (preferably caching  
> those
>       lookups for a reasonable period - 90 days, one year?)
>
>     * note whether or not you've seen the domain before
>
>     * lookup (and cache) other information about the domain name -
>       registrar, registrant, etc.
>
> The creation date currently seems the hardest to fake, though it's  
> expensive
> to calculate and I suppose eventually the spammers will start  
> creating their
> own registrars (if they haven't already) and back-date the  
> information they
> provide.

One of the things on my to-do list is to store information like this  
in the ham & spam I archive so that these sorts of things can be  
tested with the 'traditional' tools.  I have a script that does a  
bunch of DNS-based information gathering (SURBL lookups, DomainKey,  
SenderID, DNS blacklists - not the things you list above, but that  
wouldn't be that hard to add), and just need to figure out how to get  
fetchmail working properly (on OS X) so that the mail is retrieved  
and piped through it.

If you create a patch for any of the above, I'd be happy to use it  
day-to-day and let you know what appears in the token database.

> I suppose you could start tokenizing these one-letter runs as well  
> and see
> if they contain embedded words:
>
>     C x I e A a L g I c S l ==> CIALIS

This seems a little too specific for me - there are lots of other  
ways to hide the rubbish letters apart from putting them in lower case.

> Thoughts?  Anybody else seeing lots of this stuff sneak through as  
> unsure?

I see a few, although I have more problems with image spam (no  
successful patches there yet).

=Tony.Meyer
_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to