David> Something that comes up over and over in spam is a link of the
    David> form:

    David>     <a href="http://url/of/spammers/site";>
    David>        http://url/of/some/legit/site
    David>     </a>

    David> Does SpamBayes have a token that represents that information and
    David> an option I can set that will use it?

The SpamBayes tokenizer essentially splits the message at word boundaries,
so the two urls are considered separately.  Their physical and structural
proximity is not noted.  Synthetic tokens based on hostname or IP address in
the urls will be generated if you add x-pick_apart_urls:True to the
Tokenizer section of your config file.  For completeness here is my current
set of tokenizer settings (haven't changed them in a long while):

    [Tokenizer]
    record_header_absence:True
    summarize_email_prefixes:True
    summarize_email_suffixes:True
    mine_received_headers:True
    x-pick_apart_urls:True
    x-fancy_url_recognition:False
    x-lookup_ip:True
    lookup_ip_cache:~/tmp/dnscache.pck
    x-image_size:True
    x-crack_images:True
    x-ocr_engine:gocr
    max_image_size:100000
    crack_image_cache:~/tmp/imagecache.pck

Skip

_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to