David> Something that comes up over and over in spam is a link of the
David> form:
David> <a href="http://url/of/spammers/site">
David> http://url/of/some/legit/site
David> </a>
David> Does SpamBayes have a token that represents that information and
David> an option I can set that will use it?
The SpamBayes tokenizer essentially splits the message at word boundaries,
so the two urls are considered separately. Their physical and structural
proximity is not noted. Synthetic tokens based on hostname or IP address in
the urls will be generated if you add x-pick_apart_urls:True to the
Tokenizer section of your config file. For completeness here is my current
set of tokenizer settings (haven't changed them in a long while):
[Tokenizer]
record_header_absence:True
summarize_email_prefixes:True
summarize_email_suffixes:True
mine_received_headers:True
x-pick_apart_urls:True
x-fancy_url_recognition:False
x-lookup_ip:True
lookup_ip_cache:~/tmp/dnscache.pck
x-image_size:True
x-crack_images:True
x-ocr_engine:gocr
max_image_size:100000
crack_image_cache:~/tmp/imagecache.pck
Skip
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev