Hi guys, Based upon some information from others on the list I have put together a plugin for SA which canonicalises an email into it's basic "concepts". Concepts are converted to tags, which Bayes can use as tokens to further help identify spammy/hammy characteristics
Here are some examples of tags from some emails today - ---8<--- X-SA-Concepts: experience regards money optout time-ref dear great home request member enjoy woman-adj important online click all-rights email-adr please price best hot-adj X-SA-Concepts: experience contact optout winner time-ref survey dear home privacy prize store thankyou important click gift chance please X-SA-Concepts: google law search-eng optout amazing order facebook goodtime privacy lotsofmoney request enjoy details service partner linkedin twitter trust contact time-ref great online click shop email-adr please customer newsletter news X-SA-Concepts: photos view-online money contact optout time-ref cost reply2me service details online click please X-SA-Concepts: friend hotwords trust experience regards contact time-ref medical woman drugs consultant pill mailto woman-adj secret health earn email-adr please security hot-adj day-of-week X-SA-Concepts: https mailto re euros regards money youtube invoice email-adr facebook best hair ---8<--- This plugin essentially adds an extra layer between the raw input characteristics and recognition types - allowing clustering of different characteristics to a more generic type - in effect giving Bayes more of a two-layer neural network approach. When combined with Bayes learning these email semantics (or Concepts) can then be combined with the multiple other characteristics of that email, to then be compared to other email that came before it. https://github.com/fmbla/spamassassin-concepts I'd be really interested to hear your feedback/thoughts on this system and it's approach. Paul -- Paul Stead Systems Engineer Zen Internet