>> The message will have to HTML for this to work, though, or the ascii >> art is too huge to fit anything in (or the ways that it can be drawn >> are limited). (Looking at the plain-text version of the message, I >> can't make out the words at all). > > No? I don't run any HTML engine on my mails - ever (unless I have > vaguely checked the HTML code and it comes from a clueless family > member, and even then, it goes through "lynx -dump"), and the ascii > art was clear and readable to me. Maybe a question of habit (I suppose > you did use a fixed pitch font to look at the plain-text version?
Yes - I read email in Outlook, in a fixed-width font (Courier New, size 10). Looking at the plain-text version of the message the writing was massive (with the default font size, rather than the HTML-specified size) and so hard to read. It also had a blank line between each line, which made it even worse. Have you looked at it in HTML? It really is much clearer (unless your regular font size is that small, I suppose). > What does it score if you remove the spammy parts? I mean: > > - the following line: > > <a > href="http://vietnamese.com.medattuneto.com/?Bggiw/x">more > convenience: LOow price meds<BR> > > - all HTML tags, and the text/html MIME declaration > > - maybe "satisfaction" from the title? 0.994954235889 - almost no difference at all. > > (All but five of the spam clues were hapaxes, so this is almost > > certainly because I received something very similar that was unsure > > and then trained on). > > I see. Were some of them the "random" words making up the ASCII art? > Then you may have gotten the very same spam before :) Yes, many of them were. It could also have been spam that uses the same technique and happened to hit on the same sequences of characters. There's a huge variety of ways that the image could be composed, but if the same ASCII art generation engine is creating them all then there will probably be a few sequences that are commonly used. It's certainly extremely unlikely that any of them would appear in ham. It would be most interesting to know what it scored for someone else that hasn't trained on a message like this before. I know that I haven't seen any false negatives from these (just unsures), but I can't recall what the scores were. > Really going to eye space for that kind of thing needs OCR... That's a > wholly new level of complexity throw in. There are mid steps, but yes, that would be one way of doing it. I haven't used OCR in about a decade, so I really have no idea what good OCR is like these days. I would have thought that it would be reasonably fast & accurate by now (it wasn't that bad then). Regular OCR wouldn't help here, probably, since if it was any good it would churn out the same ASCII art. You'd need to pre-process the image first (blurring it, for example, or maybe some loose edge detection) probably. >> If this sort of thing works, then more of that might be necessary, >> although I think there are other ways of countering this. > > What ways are you thinking about? Things like Jonathan A. Zdziarski's Bayesian Noise Reduction, for example. IIRC the original intent was to counter things like word salad (which the ASCII art technique sometimes also employs), but I would presume that a message like this would be almost all noise (although I haven't done any testing on this). <http://www.nuclearelephant.com/papers/bnr.html> You could do other analysis like this, too, like looking at sentence structure (of which there is none here) - not a dead giveaway (there are plenty of people sending appallingly written email), but still a clue. There's also plenty of content in the headers that this trick does nothing to hide, which can be analysed by simple statistical techniques like SpamBayes uses, or with black- or grey-listing, or with social-circle analysis, or whatever. <http://spamconference.org/abstracts.html#Boykin> =Tony.Meyer _______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
