RE: [spambayes-dev] Re: Evading bayesian spam filtering?

Tony Meyer Wed, 16 Mar 2005 22:17:54 -0800

>> The message will have to HTML for this to work, though, or the ascii
>> art is too huge to fit anything in (or the ways that it can be drawn
>> are limited).  (Looking at the plain-text version of the message, I
>> can't make out the words at all).
> 
> No? I don't run any HTML engine on my mails - ever (unless I have
> vaguely checked the HTML code and it comes from a clueless family
> member, and even then, it goes through "lynx -dump"), and the ascii
> art was clear and readable to me. Maybe a question of habit (I suppose
> you did use a fixed pitch font to look at the plain-text version?


Yes - I read email in Outlook, in a fixed-width font (Courier New, size 10).
Looking at the plain-text version of the message the writing was massive
(with the default font size, rather than the HTML-specified size) and so
hard to read.  It also had a blank line between each line, which made it
even worse.

Have you looked at it in HTML?  It really is much clearer (unless your
regular font size is that small, I suppose).

> What does it score if you remove the spammy parts? I mean:
> 
>  - the following line:
> 
>    <a 
> href="http://vietnamese.com.medattuneto.com/?Bggiw/x";>more 
> convenience: LOow price meds<BR>
> 
>  - all HTML tags, and the text/html MIME declaration
> 
>  - maybe "satisfaction" from the title?

0.994954235889 - almost no difference at all.

> > (All but five of the spam clues were hapaxes, so this is almost
> > certainly because I received something very similar that was unsure
> > and then trained on).
> 
> I see. Were some of them the "random" words making up the ASCII art?
> Then you may have gotten the very same spam before :)

Yes, many of them were.  It could also have been spam that uses the same
technique and happened to hit on the same sequences of characters.  There's
a huge variety of ways that the image could be composed, but if the same
ASCII art generation engine is creating them all then there will probably be
a few sequences that are commonly used.  It's certainly extremely unlikely
that any of them would appear in ham.

It would be most interesting to know what it scored for someone else that
hasn't trained on a message like this before.  I know that I haven't seen
any false negatives from these (just unsures), but I can't recall what the
scores were.

> Really going to eye space for that kind of thing needs OCR... That's a
> wholly new level of complexity throw in.

There are mid steps, but yes, that would be one way of doing it.  I haven't
used OCR in about a decade, so I really have no idea what good OCR is like
these days.  I would have thought that it would be reasonably fast &
accurate by now (it wasn't that bad then).  

Regular OCR wouldn't help here, probably, since if it was any good it would
churn out the same ASCII art.  You'd need to pre-process the image first
(blurring it, for example, or maybe some loose edge detection) probably.

>> If this sort of thing works, then more of that might be necessary,
>> although I think there are other ways of countering this.
> 
> What ways are you thinking about?

Things like Jonathan A. Zdziarski's Bayesian Noise Reduction, for example.
IIRC the original intent was to counter things like word salad (which the
ASCII art technique sometimes also employs), but I would presume that a
message like this would be almost all noise (although I haven't done any
testing on this).

<http://www.nuclearelephant.com/papers/bnr.html>

You could do other analysis like this, too, like looking at sentence
structure (of which there is none here) - not a dead giveaway (there are
plenty of people sending appallingly written email), but still a clue.

There's also plenty of content in the headers that this trick does nothing
to hide, which can be analysed by simple statistical techniques like
SpamBayes uses, or with black- or grey-listing, or with social-circle
analysis, or whatever.

<http://spamconference.org/abstracts.html#Boykin>

=Tony.Meyer

_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

RE: [spambayes-dev] Re: Evading bayesian spam filtering?

Reply via email to