Re: [Spambayes] Analyzing text in image spam

Peter Barker Sun, 20 Aug 2006 16:41:32 -0700

I have installed the CVS version as suggested. A couple of points which may 
help others trying it (especially the PIL). I am using FC5 on AMD64, and had 
to install tk-devel, tcl-devel as well as tk and tcl (and tkinter etc). To 
get PIL to successfully include support for everything I had to 
add /usr/lib64 to the standard paths in setup.py. The freetype2 files 
required by PIL are in freetype-devel.


I will report how it performs in a few days. Is there any way I can easily 
test it with my current spam collection without creating a new .hammiedb and 
starting again? My email is stored in one file/folder (mbox). I tried just 
feeding a few messages which had been incorrectly classified, and they were 
now classified as spam, but I think that is because I had trained them as 
spam after I received them (with version 1.1a2). I am using kmail with 
sb_bnfilter.py. Can I tell from the X-Spambayes-Evidence header if the new 
code is detecting any spam?

Regards,
Peter Barker

> >>>>> "skip" == skip  <[EMAIL PROTECTED]> writes:
>
> I should have given a bit more complete answer based on your message's more
> general point.  I recently added a fair amount of code to SpamBayes to
> "crack" the content of images.  The new code works very well for me.  If
> you'd like to try it, here's what you'll need to do:
>
>     1. Check out the latest source from the CVS repository.  (There's been
>        no new release since my recent checkins.)  Install it.
>
>     2. Install the Python Imaging Library:
>            http://www.pythonware.com/products/pil/
>
>     3a. (Windows) Grab the ocrad-cygwin package from the
>        SpamBayes Files page:
>            http://sourceforge.net/project/showfiles.php?group_id=61702
>        Unpack the zip file and copy ocrad.exe somewhere on your PATH.
>
>     3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web
>         site:
>             http://www.gnu.org/software/ocrad/ocrad.html
>         Unpack and install it.
>
> I realize this may not be all that straightforward for people who are
> unused to installing open source software.  Once you've done it a couple
> times though, it gets easier.  Hopefully, we can get another SpamBayes
> alpha release out in the next little while.  (Tony, if there's anything I
> can do to help make this happen, let me know.)
>
> Once you're ready to go, add the following to your SpamBayes options:
>
>     x-lookup_ip: True
>     lookup_ip_cache: ~/.dnscache
>
>     x-image_size: True
>
>     x-crack_images: True
>     crack_image_cache: ~/.image_cache.pickle
>
> The first group is unrelated to the image spam, but I find it helps me a
> lot.  It maps hostnames to their IP addresses using DNS and generates
> tokens based on those addresses.  The second records tokens about the size
> of images.  The third enables text extraction from images (OCR, or optical
> character recognition).  This is where PIL and Ocrad come in.
>
> I still get the occasional false negative on image spam, but it's
> definitely manageable and should improve as Ocrad (itself still a very
> alpha piece of software) improves.  Even though Ocrad does a poor job of
> text extraction from a human comprehension standpoint, it generates tokens
> that SpamBayes just loves and seems to generate enough unique tokens to tip
> the scales on most image spam.
>
> Skip
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Re: [Spambayes] Analyzing text in image spam

Reply via email to