David> The problem I'm having right now, I think, is that in those
    David> messages where the image spam isn't successfully OCR'd, the
    David> garbage words around the image get trained and degrade the
    David> overall performance of my system.  Of course, that's just a
    David> guess, but it sure seems like these days a lot more plain spam
    David> messages that ought to be recognized as such are sneaking through
    David> than used to.

If you run ocrad over some spam text images you can see what it generates.
If it finds nothing, nothing comes out the back end.  If it sees something,
it's almost certain to be some garbage text peculiar to it, unlikely to turn
up in normal text.  For example, here's a pretty clean image:

    http://www.webfast.com/~skip/bogus-5-3.png

Here's what ocrad produces by default:

    COULD THl_ BE THE NEXT IBM_
    ALL _|___ _wow IWAl LllL |_ ABO_| lo EXPLODEl
    WAIIW LllL p_ Ll_E A WAW_ _IARll__ WO_DA_ _EPIEWBER lll

    IomO_n_ __m_ L |_IL IOWP_IER_ |_I (o_h__ OII LllL p_)
    __o__ __mbol LllL
    F_ld__ Ilo__ O Tl (_o s_/_ On F_ld__ Alon_|)
    _ d__ |__o__ __
    I____n_ R__lnO ___onO B__
    \
    ln _h_ Io____ ot _ W___. LllL W____ ______| ___nnlnO Wo___'

    L ln___n__lon_| Anno_n___

    On_lo__h(IW) _P_o_P__ TP_hnoloO_ b_
    B_llP_ p_oo_ Da_a _P___|__ Ba_k_O_ and _P__o_P_
    |__ ____ __n____lon p__Aqco_TM_/P__AID CO_TM_
    _|__a Po__ablP wloh _OPPd _olld __a_P D_|_P TP_hnoloO_
    _h_ W___oOoll_. _hP Wo_ld _ _|___ _g laO_oO ComOrfP_
    _Pa___lnO W_ldla _ Q_a_ll TP_hnoloO_
    \
    L ln___n__lon_| _IOn_ _4 _W E__oO__n Dl___lb__lon AO___m_n_

    Th_ b_Pmo__ __PO b_wa_d _a__|_al _Pn___P |_ amonO o_hP_ p__|__|_P
    dl___lb__lon aO_PPmPn__ ____Pn_|_ _ndP_ nPOo_la_lon ยช_ _P_P_al addl_lonal
    hlOh O_ofi_ _POlon_ and _PO_P_Pn__ a kP_ ___a_POl_ Oa__nP__hlO _ha_ _P___P_
    l ln_P_na_lonal ComO__P__ wl_h ___|_ Olobal ma_kP_ _Pa_h and O_a_an_PPd
    O_P _alP_ and lo_k_ _hP _omOan_ ln hlOhl_ dP_|_ablP p__|__|_P dl___lb__lon
    ma_kP__

    READ MORE ONLINE NOWl

    OPPORl__||_ DOE_ _ol __OI_ o_ IWE DOOR E_ER_ DA_|
    _o _A_E A Wl__IE IOODD LllL lo _O_R RADAR _ow A_D
    WAIIW II _OARl

So, even though ocrad doesn't do a very good job extracting text from the
image, most of the "words" it does produce are likely to be unused unless
they turn up in some other similar image spams.  The only drawback I can see
to those extra tokens is a bit of database bloat.

    Seth> Nonetheless, I suspect that Spambayes could improve by creating
    Seth> more synthetic tokens that describe the image better and taking
    Seth> advantage of serendipitous differences between tokens for image
    Seth> spam and those in each user's ham.

Correct on the last part.  It's unlikely that "_IOn__" will turn up in
normal text unless it's Klingon text.  If you record it as a spam clue in
one email and it turns up in another, that's probably a good sign they are
related.

As to the "creating more synthetic tokens", I'm open to suggestions.
Ignoring its OCR features, I think SpamBayes currently identifies that an
image is present, its mime type (distinguishing gif spams from Grandma's
jpeg photos for example) the log of its size.  Maybe it could generate clues
related to the image's dimensions, the total number of images in the email
or number of distinct colors.  Do you have other suggestions?

Skip
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to