>> Their physical and structural proximity is not noted.  Synthetic
    >> tokens based on hostname or IP address in the urls will be generated
    >> if you add x-pick_apart_urls:True to the Tokenizer section of your
    >> config file.

    Dave> That doesn't sound like it's doing what I'm asking about.  

No, it's not, however, you might be surprised how helpful it is to generate
tokens for the /8, /16, /24 and /32 address blocks can be.  I what I was
implying is that maybe you don't need the spoof detection you were asking
for if the address tokens generated from the spammer's IP address are
spammy.

    Dave> I want a special token that is generated each time a link's text
    Dave> is just a URL and the link and the URL text don't point to the
    Dave> same place.

That will require actually parsing the HTML at some level.  SpamBayes just
sees a stream of tokens.  It doesn't really know much (if anything) about
compound structure.

    Dave> Messages with this property are always spam and account for a
    Dave> large percentage of my unsures.  

Try these two settings

    x-pick_apart_urls:True
    x-lookup_ip:True

and see if they help.

    Dave> From what you say above it looks like pick_apart_urls will
    Dave> generate tokens describing different parts of a given URL, but
    Dave> will do nothing to help capture this particular spammy
    Dave> relationship between enclosed text and actual link.

    Dave> Or did I misunderstand you?

No, I probably misunderstood myself.  The IP address hacker is the
x-lookup_ip option I believe.  They are both helpful though.

Skip
_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to