Re: Skipping RBL checks for internal servers

Reindl Harald Wed, 18 Mar 2015 17:13:00 -0700


Am 19.03.2015 um 00:54 schrieb RW:

On Wed, 18 Mar 2015 23:57:13 +0100
Reindl Harald wrote:


Am 18.03.2015 um 23:34 schrieb RW:

On Wed, 18 Mar 2015 22:46:14 +0100
Reindl Harald wrote:


frankly i trained over months with *hand chosen* mail smaples and
spent nearly two weeks day and night to remove bayes-posioning from
the samples and rebuild bayes from scratch leading in reduce the
ntokens from 1700000 to 1500000


Why did you remove the Bayes-poison?


because now BAYES_00 in case of legit mail is at 87% of all scanned
messages, BAYES_50 dropped from 10% to 4% and the milter-rejects are
still at around 8-10% with just 10 instead 150 flagged message on a
userbase with 1200 vaild RCPT's

because finally the bayes has a quality that it needs few to no
further training at all in combination with other filters

over the long the poision leads in more and more legit mail becoming
a higher score as deserved, the FP rate increases and at the end you
need to lower the reject score passing more junk because user
complaints - at that point the spammers won, you need to reset bayes
sooner or later and start from scratch with training

that's not theory, i observed that behavior over many years with
commercial appliances using SA behind the scenes and enabled
auto-learning


This is nothing to do with auto-learning. There is a difference between
miss-training and training with spam that contains so-called "Bayes
poison".  Bayes is best trained on what is in real-world spam, not
what we would prefer that spammers put in spam

it's the same - it is exactly the same and it is not a matter "what we would prefer that spammers put in spam" but what they put *additional* to it to ruin bayes and filter results

if you train only manually reviewed messages and don't recognize hidden poision often three times more than the visible part up to additonal mime-parts dedicated for poison with diffrent crap at the end of the plain-alternate as well as in hidden alyers, span-tags and div-tags *excatly* the same happens for auto-learning

the point is "Bayes is best trained on what is in real-world spam" but not with if the spam content is only a small part of the message because you train at the same time innconect parts as spam


you can't control that with auto-learning

the effect is visible:

* BAYES_00 hits are more than before
* BAYES_50 hits for ham are less than before
* ANY of the cleaned messages have still BAYES_99 and most BAYES_999

the last point is easy to prove by having the old, unmodified corpus and run spamc against the cleaned bayes database and the final result is that you stop training in circles because you need a ton of classified ham messages to reduce the pision impact

if you have users from all over the world speaking different languages the effect of bayes poisioning get much more visible because it contains random words in al sort of languages and you don't have enough ham to reduce that damage

believe it or not - my goal is to train a bayes database once and have a sane system over many many years - what i read often is "spam samples become outdated and so you need to restart" - no they don't, frankly even dead-safe subjects to block exists for 10 years, the cause is auto-learning and not that spam changes all the time - it changes in case of new templates and types, but spam samples from years ago re-appear all the time - with the permanent reset and expire you are running in circles and start to train the same crap again and again

it will work the same way as people in 2008 after "the virtual servers i now installed will run the next 10 to 15 years at least" told me "until then you install from scratch multiple times", well, the first 7 years have passed....

signature.asc
Description: OpenPGP digital signature

Re: Skipping RBL checks for internal servers

Reply via email to