They cannot (do not want, do not have the know how) study the e-mails, and therefore they cannot build a reliable corpus. All they can do is to trust the ability of their users to study their own e-mails well enough to do the job, hence the mess with ham/spam when feeding the Bayesian filter. They need to consult with a lawyer, fix their paperwork, hire people who can teach them everything they need to know, and invest at least two years full-time in the process. They just cannot install centos and SA and hope Bayesian filters to do their job out of magic. It just does not work that way.
Sent from ProtonMail Mobile On Wed, Feb 14, 2018 at 05:48, Bill Cole <sausers-20150...@billmail.scconsult.com> wrote: > On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote: > This is a production mail > gateway serving since 2015. I saw that a few > messages (both hams and spams) > automatically learned by > amavisd/spamassassin. Today's statistics: > > 3616 > autolearn=ham > 10076 autolearn=no > 2817 autolearn=spam > 134 > autolearn=unavailable That's quite high for spam, ham, AND "unavailable" > (which indicates something wrong with the Bayes subsystem, usually > transient.) This seems like a recipe for a mis-learning disaster. For > comparison, my 2018 autolearn counts: spam: 418 ham: 15018 unavailable: 166 > no: 129555 I also manually train any spam that gets through to me (the > biggest spam target,) a small number of spams reported by others, and 'trap' > hits. A wide variety of ham is harder to get for training but I have found it > useful to give users a well-documented and simple way to help. One way is to > look at what happens to mail AFTER delivery which can indicate that a message > is ham without needing an admin to try to make a determination based on > content. The simplest one is to learn anything users mark as $NotJunk as ham. > Another is to create an "Archive" mailbox for every user and learn anything > as ham that has been moved there a day after it is moved. The most important > factor (especially in jurisdictions where human examination of email is a > problem) is to tell users how to protect their email and then do what you > tell them, robotically. In the US, Canada, and *SOME* of the EU, this is not > risky. However, I have been told by people in *SOME* EU countries that they > can't even robotically scan ANY mail content, so you shouldn't take my advice > as authoritative: I'm not even a lawyer in the US, much less Hungary... > I > think I have no control over what is learnt automatically. Yes, you do. Run > "perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details. You can > set the learning thresholds, which control what gets learned. The defaults > (0.1 and 12) mis-learn far too much spam as ham and not enough spam. I use > -0.2 and 6, which means I don't autolearn a lot but everything I autolearn as > ham has at least one hit on a substantial "nice" rule or 2 hits on weak ones. > There's a lot of vehemence against autolearn expressed here but not a lot of > evidence that it operates poorly when configured wisely. The defaults are NOT > wise. > Let's just assume for a moment that 1.4M ham-samples are valid. Bad > assumption. Your Bayes checks are uncertain about mail you've told SA is > definitely spam. That's broken. It's a sort of breakage that cannot exist if > you do not have a large quantity of spam that has been learned as ham. > Is > there a ham:spam ratio I should stick to it? No. > I presume if we have a 1:1 > ratio then future messages won't be > considered as spam as well. The > ham:spam ratio in the Bayes DB or its autolearning is not a generally useful > metric. 1:1 is not magically good and neither is any other ratio, even with > reference to a single site's mailstream. A very large ratio *on either side* > indicates a likely problem in what is being learned, but you can't correlate > the ratio to any particularly wrong bias in Bayes scoring. It is an > inherently chaotic relationship. Factors that actually matter are correctness > of learning, sample quality, and currency. You can control how current your > Bayes DB is (USE AUTO-EXPIRE) but the other two factors are never going to be > perfect.