[spambayes-bugs] [ spambayes-Patches-1475188 ] note runs of short words

SourceForge.net Sun, 23 Apr 2006 15:29:33 -0700

Patches item #1475188, was opened at 2006-04-23 17:29
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=1475188&group_id=61702


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Skip Montanaro (montanaro)
Assigned to: Tony Meyer (anadelonbrin)
Summary: note runs of short words

Initial Comment:
I've recently been seeing a lot of pharmacy spam with few, if any, clues.
The message bodies look like this:

    X j A m N j A d X h
    M k E z R d I p D u I m A c
    C o I d A t L j I v S j
    A j M w B p I q E s N p
    V a I a A g G z R j A f
    S b O n M u A p
    V f A g L m I h U q M b
    S u A u V o E q n O t V y E n R d r 7 b 0 k % d x W c I n T p H d u O s
    U h R i b S s H q O p P h ! k

    http://www.chilreanno.com

Followed by some drivel meant to boost "good" words.  The URL 
changes
frequently, and like most spam, it seems to come from all over the 
place,
so there are very few clues present for SpamBayes to munch on.

The attached patch pays attention to runs of words which are too
short for other consideration and emits a token that's the base 2
log of the longest run of such words seen in the message.  The result
seems to add an extra useful structural token to the mix and makes 
these particular types of spam less likely to score unsure.

I didn't just check it in for a couple reasons.  One, I was targetting just
a single kind of message.  I'm not anxious to get into a
SpamAssassin-type escalation of, "hey, this kind of message does
this, let's try that", sort of thing.  I'd prefer it if the concept was
applicable to a broader variety of spams.  Two, I no longer have any
sort of test database other than my current personal collection of ham
and spam (between 300 and 400 messages), so I can't really test it
properly to see if it's a net win.

Like I said, it seemed to help in this instance.  Here's my collection
of short:* tokens:

    token,nspam,nham,spam prob
    short:7,3,0,0.934782608696
    short:6,6,0,0.96511627907
    short:5,2,1,0.5
    short:4,3,2,0.366449889676
    short:3,3,1,0.5
    short:2,19,15,0.319154484346
    short:1,196,69,0.5
    short:0,63,25,0.5

My database is currently a bit unbalanced (5 spams for every 2 hams),
hence (I think) the unusual spamprobs.

Assigning to Tony just so someone has a chance to give it the once
over during the 1.1 alpha phase.

Skip


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=1475188&group_id=61702
_______________________________________________
Spambayes-bugs mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-bugs

[spambayes-bugs] [ spambayes-Patches-1475188 ] note runs of short words

Reply via email to