Am 17.11.2015 um 18:04 schrieb John Hardin:
On Tue, 17 Nov 2015, Reindl Harald wrote:Am 17.11.2015 um 05:15 schrieb Eric Abrahamsen:I used "sa-learn --dump magic --dbpath ...." on several of my virtual users, and it's hard to tell what's going on -- they seem to have their own databases, but most have little or nothing in them, which makes me think the script is not actually recording the learning properlya per user-bayes don't work for most sites just because you need enough ham *and* spam to get it working properly and most users don't care enough or train it wrong (move newsletters they subscribed and to lazy to unsubscribe in the spamfolder) a hand trained site-wide bayes works much better and don't demand *every* user collect enough samples and understand how it works+1 Being blunt, your userbase can be broadly divided into "clueful" and "non-clueful". You probably don't have many clueful users whose judgement and responsibility you trust. Their submissions could potentially be trained without review. You can also set up shared misclassified ham and spam folders that non-clueful users can copy messages to, but those submissions would need to be reviewed before being moved to the *real* training corpora by you. Always keep your training corpora. (This model falls apart at the Large Company and ISP level, of course...)
that model even works at ISP level (given someone takes care of it and has the time to manage the bayes) - i am doing that here the last year for a large userbase with a handful of users submitting samples for review
over the long it's enough to have a few users and your complete own mail corpus combined with BCC-features
_________________________postfix 3.0 has a header based BCC-feature, finally you need a *sieve script* keep only messages targeted to *users which agreed* and delete anything else automated - that way you get over the time enough smaples for a proper site-wide bayes
a few users submit their stuff for review to catch wrong classified stuff not hit the header-filters
_________________________ # postfix header_checks/^X\-Spam.*Flag: No.*(BAYES_(80|95|99)|RAZOR2_CHECK|IXHASH|URIBL_BLACK)/ BCC target-addr
# spamassassin header configuration clear_headers fold_headers 1 add_header spam Flag _YESNO_add_header all Status _YESNO_, score=_SCORE_, tag-level=_REQD_, block-level=8.0, envelope=_SENDERDOMAIN_, from=_AUTHORDOMAIN_
report_safe 0 add_header all Report Flag: _YESNO_, _REPORT_ rewrite_header Subject [SPAM] _________________________the result after a year is a large corpus, stripped a lot of headers with "formail" and widely anonymized with scripts and a large part of ham-mails get BAYES_00
0 51738 SPAM 0 20976 HAM 0 2306989 TOKEN BAYES_00 18088 74.71 % BAYES_05 452 1.86 % BAYES_20 625 2.58 % BAYES_40 458 1.89 % BAYES_50 1868 7.71 % BAYES_60 254 1.04 % BAYES_80 237 0.97 % BAYES_95 171 0.70 % BAYES_99 2055 8.48 % BAYES_999 1888 7.79 % DELIVERED 33215 93.46 % DNSWL 31874 89.68 % SPF 22440 63.14 % SPF/DKIM WL 10607 29.84 % SHORTCIRCUIT 11301 31.79 % BLOCKED 3025 8.51 % SPAMMY 2717 7.64 % 89.81 % (OF TOTAL BLOCKED) _________________________most crap is filtered by postscreen long before spamassassin, so only the hard part of the remaining junk needs bayes training
spamhaus.org 113785 thelounge.net 60292 inps.de 40826 sorbs.net 26478 barracudacentral.org 21422 psbl.org 393 junkemailfilter.com 317 manitu.net 227 spamcop.net 30 mailspike.net 13 swinog.ch 2 ================================= Total DNSBL rejections: 263785
signature.asc
Description: OpenPGP digital signature