Re: [Spambayes] SpamBayes to Handle Embedded Images

Erik Brown Mon, 03 Oct 2005 14:56:50 -0700

FMJ,

You need to experiment with the following config options. I do not have a problem whatsoever with embedded images. They usually link to a site and it gets all of the related tokens. Try for yourself and report back:

[Classifier]

# Generate both unigrams (words) and bigrams (pairs of words). However,

# extending an idea originally from Gary Robinson, the message is

# 'tiled' into non-overlapping unigrams and bigrams, approximating the

# strongest outcome over all possible tilings. Note that to really test

# this option you need to retrain with it on, so that your database

# includes the bigrams - if you subsequently turn it off, these tokens

# will have no effect. This option will at least double your database

# size given the same training data, and will probably at least triple

# it. You may also wish to increase the max_discriminators (maximum

# number of extreme words) option if you enable this option, perhaps

# doubling or quadrupling it. It's not yet clear. Bigrams create many

# more hapaxes, and that seems to increase the brittleness of minimalist

# training regimes; increasing max_discriminators may help to soften

# that effect. OTOH, max_discriminators defaults to 150 in part because

# that makes it easy to prove that the chi-squared math is immune from

# numeric problems. Increase it too much, and insane results will

# eventually result (including fatal floating-point exceptions on some

# boxes). This option is experimental, and may be removed in a future

# release. We would appreciate feedback about it if you use it - email

# [email protected] with your comments and results.

x-use_bigrams: True

[Tokenizer]

# This non-default option is very effective

# at nailing Asian spam with little training and small database burden.

# It should probably be exposed via the GUI, as it's not appropriate

# for people who get "high-bit ham". Asian spam is nailed with this

# False too, but it requires more training and a larger database, since

# a sufficient variety of "8bit%" and "skip" metatokens take longer to

# learn about than strings of question marks.

replace_nonascii_chars: True

# It's helpful for Tim <wink>.

record_header_absence: True

# Recognize 'www.python.org' or ftp.python.org as URLs instead of just

# long words.

x-fancy_url_recognition: True

# Note whether url contains non-standard port or user/password elements.

x-pick_apart_urls: True

basic_header_tokenize: True

basic_header_skip: date x-.* domainkey-signature list-.*

check_octets: True

mine_received_headers: True

summarize_email_prefixes: True

summarize_email_suffixes: True

skip_max_word_size: 50

[URLRetriever]

# So that SpamBayes doesn't need to retrieve the same URL over and over

# again, it stores local copies of the text at the end of the URL. This

# is the directory that will be used for those copies.

x-cache_directory: url-cache

# This is the number of days that local cached copies of the text at the

# URLs will be stored for.

x-cache_expiry_days: 31

# To try and speed things up, and to avoid following unique URLS, if

# this option is enabled, SpamBayes will convert the URL to as basic a

# form it we can. All directory information is removed and the domain is

# reduced to the two (or three for those with a country TLD) top-most

# elements. For example,

# http://www.massey.ac.nz/~tameyer/index.html?you=me would become

# http://massey.ac.nz and http://id.example.com would become

# http://example.com This should have two beneficial effects: o It's

# unlikely that any information could be contained in this 'base' url

# that could identify the user (unless they have a *lot* of domains). o

# Many urls (both spam and ham) will strip down into the same 'base'

# url. Since we have a limited form of caching, this means that a lot

# fewer urls will have to be retrieved. However, this does mean that if

# the 'base' url is hammy and the full is spammy, or vice-versa, that

# the slurp will give back the wrong information. Whether or not this is

# the case would have to be determined by testing.

x-only_slurp_base: True

# If this option is enabled, when a message normally scores in the

# 'unsure' range, and has fewer tokens than the maximum looked at, and

# contains URLs, then the text at those URLs is obtained and tokenized.

# If those tokens result in the message moving to a score outside the

# 'unsure' range, then they are added to the tokens for the message.

# This should be particularly effective for messages that contain only a

# single URL and no other text.

x-slurp_urls: True

# It may be that what is hammy/spammy for you in email isn't from

# webpages. You can then set this option (to "web:", for example), and

# effectively create an independent (sub)database for tokens derived

# from parsing web pages.

# "x-web_prefix" is a string value that defines a prefix to be added to tokens

# generated from a slurped URL. This would be used if you wanted the tokens

# generated from a web page to be separate from the tokens generated from the

# body of an email message. For example, the config setting

# "x-web_prefix:web:" would generate a token "spambayes" if it appears in an

# email and "web:spambayes" if it appears in a slurped URL.

x-web_prefix:web:

Erik Brown

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of [EMAIL PROTECTED]
Sent: Sunday, October 02, 2005 6:44 PM
To: 'Herb Martin'
Cc: [email protected]
Subject: Re: [Spambayes] SpamBayes to Handle Embedded Images

Herb,

OCR is probably the only sure-fire way to nail this scourge. As far as being resource intensive, like most other people with always-on broadband access now, my e-mail just trickles in a little at a time. And many/most PCs are powerful enough to stream video now-a-days; they really shouldn't have a problem with it being added as a feature. It's a lot more disruptive to manage these by hand, if you ask me. And an OCR feature could allow itself to be disabled, if it ended up being a performance problem for someone.

It's gotta be done. Now that these spammers have found an easy way to trick these engines to be digging through meaningless text, there'll be no slowing them without OCR. I'm getting more and more of this style of Spam. Easy to install/use programs like SpamBayes have to keep up with the times, or they'll die on the vine. Years ago, when we mostly exchanged text-based e-mail, it wasn't an issue. But now, nearly all of the e-mail I receive is HTML; and lots of it has images.

I'm ONLY using SpamBayes with Outlook 2003 (at home, where I'm having all the trouble). I love the easy button-based re-training! And I don't really care for the idea of having to add, train, and administer another layer.

Other than a miraculous OCR feature showing up in SpamBayes soon, I'm out of ideas for a simple way of managing this type of mail on my home PC. (Very frustrating).

Thanks,

FMJ

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Herb Martin
Sent: Sunday, October 02, 2005 12:43 PM
To: [email protected]
Subject: Re: [Spambayes] SpamBayes to Handle Embedded Images

Back in April, Tony Meyer posted that he was receiving a lot of image-based spam.

I too am having nothing but trouble with embedded images:

- Daily adds for fake Rolex watches

- Daily stock tips

- TONS of drugs for sale.

This style of Spam contains an image at the top, followed by a bunch of totally unrelated text that has been copied from some kind of random composition. I have very large Spam & Ham folders, that I've successfully trained SpamBayes with. It's only these image-based adverts that sneak by EVERY DAY.

Mostly my SpamBayes catches ALL of these when anything gets this far...

Something really needs to be done about this type of Spam within SpamBayes. Are any other Spam engines able to handle this stuff, by scanning the image for text, or something?

Sure, there are others (as well a SpamBayes if you just keep training EVERY ONE of them) but most of the others are either commercial (i.e., cost money) OR they run on the Server (SpamAssassin, greylistd, and other filters.)

There has been talk about filters which would explicitly do OCR or some other type of image content detection but I don't (personally) know of any that are working/available/effective right now.

Such would also likely be "resource (CPU) intensive".

FWIW, greylisting on the server knocks down practically all of this junk and SpamAssassin catches the rest.

The VERY occasional item that slips through our server is caught by SpamBayes. (Defense in depth is our key to ZERO spam -- with practically everything REJECTED, not bounced, at the server during SMTP connect time.)

And some of us DO WISH to get graphical email -- picture of my grand kid(s) frequently arrive this way.

--
Herb Martin

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of [EMAIL PROTECTED]
Sent: Sunday, October 02, 2005 1:53 PM
To: [email protected]
Subject: [Spambayes] SpamBayes to Handle Embedded Images

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Re: [Spambayes] SpamBayes to Handle Embedded Images

Reply via email to