on Fri Jul 06 2007, skip-AT-pobox.com wrote: > David> Something that comes up over and over in spam is a link of the > David> form: > > David> <a href="http://url/of/spammers/site"> > David> http://url/of/some/legit/site > David> </a> > > David> Does SpamBayes have a token that represents that information and > David> an option I can set that will use it? > > The SpamBayes tokenizer essentially splits the message at word boundaries, > so the two urls are considered separately.
Yeah, I know that's the default behavior. > Their physical and structural proximity is not noted. Synthetic > tokens based on hostname or IP address in the urls will be generated > if you add x-pick_apart_urls:True to the Tokenizer section of your > config file. For completeness here is my current set of tokenizer > settings (haven't changed them in a long while): > > [Tokenizer] > record_header_absence:True > summarize_email_prefixes:True > summarize_email_suffixes:True > mine_received_headers:True > x-pick_apart_urls:True > x-fancy_url_recognition:False > x-lookup_ip:True > lookup_ip_cache:~/tmp/dnscache.pck > x-image_size:True > x-crack_images:True > x-ocr_engine:gocr > max_image_size:100000 > crack_image_cache:~/tmp/imagecache.pck That doesn't sound like it's doing what I'm asking about. I want a special token that is generated each time a link's text is just a URL and the link and the URL text don't point to the same place. Messages with this property are always spam and account for a large percentage of my unsures. No matter how much I train on them, they keep falling into unsure, so I thought if Spambayes could actually recognize their distinguishing feature I could easily train it to consider them spam. >From what you say above it looks like pick_apart_urls will generate tokens describing different parts of a given URL, but will do nothing to help capture this particular spammy relationship between enclosed text and actual link. Or did I misunderstand you? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev