Am 19.03.2015 um 00:54 schrieb RW:
On Wed, 18 Mar 2015 23:57:13 +0100 Reindl Harald wrote:Am 18.03.2015 um 23:34 schrieb RW:On Wed, 18 Mar 2015 22:46:14 +0100 Reindl Harald wrote:frankly i trained over months with *hand chosen* mail smaples and spent nearly two weeks day and night to remove bayes-posioning from the samples and rebuild bayes from scratch leading in reduce the ntokens from 1700000 to 1500000Why did you remove the Bayes-poison?because now BAYES_00 in case of legit mail is at 87% of all scanned messages, BAYES_50 dropped from 10% to 4% and the milter-rejects are still at around 8-10% with just 10 instead 150 flagged message on a userbase with 1200 vaild RCPT's because finally the bayes has a quality that it needs few to no further training at all in combination with other filters over the long the poision leads in more and more legit mail becoming a higher score as deserved, the FP rate increases and at the end you need to lower the reject score passing more junk because user complaints - at that point the spammers won, you need to reset bayes sooner or later and start from scratch with training that's not theory, i observed that behavior over many years with commercial appliances using SA behind the scenes and enabled auto-learningThis is nothing to do with auto-learning. There is a difference between miss-training and training with spam that contains so-called "Bayes poison". Bayes is best trained on what is in real-world spam, not what we would prefer that spammers put in spam
it's the same - it is exactly the same and it is not a matter "what we would prefer that spammers put in spam" but what they put *additional* to it to ruin bayes and filter results
if you train only manually reviewed messages and don't recognize hidden poision often three times more than the visible part up to additonal mime-parts dedicated for poison with diffrent crap at the end of the plain-alternate as well as in hidden alyers, span-tags and div-tags *excatly* the same happens for auto-learning
the point is "Bayes is best trained on what is in real-world spam" but not with if the spam content is only a small part of the message because you train at the same time innconect parts as spam
you can't control that with auto-learning the effect is visible: * BAYES_00 hits are more than before * BAYES_50 hits for ham are less than before * ANY of the cleaned messages have still BAYES_99 and most BAYES_999the last point is easy to prove by having the old, unmodified corpus and run spamc against the cleaned bayes database and the final result is that you stop training in circles because you need a ton of classified ham messages to reduce the pision impact
if you have users from all over the world speaking different languages the effect of bayes poisioning get much more visible because it contains random words in al sort of languages and you don't have enough ham to reduce that damage
believe it or not - my goal is to train a bayes database once and have a sane system over many many years - what i read often is "spam samples become outdated and so you need to restart" - no they don't, frankly even dead-safe subjects to block exists for 10 years, the cause is auto-learning and not that spam changes all the time - it changes in case of new templates and types, but spam samples from years ago re-appear all the time - with the permanent reset and expire you are running in circles and start to train the same crap again and again
it will work the same way as people in 2008 after "the virtual servers i now installed will run the next 10 to 15 years at least" told me "until then you install from scratch multiple times", well, the first 7 years have passed....
signature.asc
Description: OpenPGP digital signature