Re: getting the daily learning cronjob right

Reindl Harald Tue, 17 Nov 2015 09:59:49 -0800


Am 17.11.2015 um 18:04 schrieb John Hardin:

On Tue, 17 Nov 2015, Reindl Harald wrote:

Am 17.11.2015 um 05:15 schrieb Eric Abrahamsen:

 I used "sa-learn --dump magic --dbpath ...." on several of my virtual
 users, and it's hard to tell what's going on -- they seem to have their
 own databases, but most have little or nothing in them, which makes me
 think the script is not actually recording the learning properly


a per user-bayes don't work for most sites just because you need
enough ham *and* spam to get it working properly and most users don't
care enough or train it wrong (move newsletters they subscribed and to
lazy to unsubscribe in the spamfolder)

a hand trained site-wide bayes works much better and don't demand
*every* user collect enough samples and understand how it works


+1

Being blunt, your userbase can be broadly divided into "clueful" and
"non-clueful". You probably don't have many clueful users whose
judgement and responsibility you trust. Their submissions could
potentially be trained without review.

You can also set up shared misclassified ham and spam folders that
non-clueful users can copy messages to, but those submissions would need
to be reviewed before being moved to the *real* training corpora by you.

Always keep your training corpora.

(This model falls apart at the Large Company and ISP level, of course...)

that model even works at ISP level (given someone takes care of it and has the time to manage the bayes) - i am doing that here the last year for a large userbase with a handful of users submitting samples for review

over the long it's enough to have a few users and your complete own mail corpus combined with BCC-features

_________________________

postfix 3.0 has a header based BCC-feature, finally you need a *sieve script* keep only messages targeted to *users which agreed* and delete anything else automated - that way you get over the time enough smaples for a proper site-wide bayes

a few users submit their stuff for review to catch wrong classified stuff not hit the header-filters

_________________________

# postfix header_checks


# spamassassin header configuration
clear_headers
fold_headers 1
add_header spam Flag _YESNO_

add_header all Status _YESNO_, score=_SCORE_, tag-level=_REQD_, block-level=8.0, envelope=_SENDERDOMAIN_, from=_AUTHORDOMAIN_

report_safe 0
add_header all Report Flag: _YESNO_, _REPORT_
rewrite_header Subject [SPAM]
_________________________

the result after a year is a large corpus, stripped a lot of headers with "formail" and widely anonymized with scripts and a large part of ham-mails get BAYES_00


0      51738    SPAM
0      20976    HAM
0    2306989    TOKEN

BAYES_00        18088   74.71 %
BAYES_05          452    1.86 %
BAYES_20          625    2.58 %
BAYES_40          458    1.89 %
BAYES_50         1868    7.71 %
BAYES_60          254    1.04 %
BAYES_80          237    0.97 %
BAYES_95          171    0.70 %
BAYES_99         2055    8.48 %
BAYES_999        1888    7.79 %

DELIVERED       33215   93.46 %
DNSWL           31874   89.68 %
SPF             22440   63.14 %
SPF/DKIM WL     10607   29.84 %
SHORTCIRCUIT    11301   31.79 %

BLOCKED          3025    8.51 %
SPAMMY           2717    7.64 %    89.81 % (OF TOTAL BLOCKED)
_________________________

most crap is filtered by postscreen long before spamassassin, so only the hard part of the remaining junk needs bayes training


spamhaus.org              113785
thelounge.net              60292
inps.de                    40826
sorbs.net                  26478
barracudacentral.org       21422
psbl.org                     393
junkemailfilter.com          317
manitu.net                   227
spamcop.net                   30
mailspike.net                 13
swinog.ch                      2
=================================
Total DNSBL rejections:    263785

signature.asc
Description: OpenPGP digital signature

Re: getting the daily learning cronjob right

Reply via email to