Am 23.02.2015 um 00:11 schrieb RW:
On Fri, 20 Feb 2015 21:36:38 +0100 Reindl Harald wrote:And I'd suggest the same for non-spam, train duplicative ham even if it happens to be similarly addressed to different users. More data is (nearly) always better for bayesian learning systemsof courseWith the caveat that you keep an eye on retention.
of course, or you disable autoexpire and autolearning in case of hand-maintained bayes
i for myelf don't trust any automatism in that case because it leads easily in train false positives as well as false negatives or destroys the ham/spam balance in one or the other direction
been there, done that, the results can be both: * spam detection becomes over the time unrelieable * most mails, especially newsletters take spam direction
in doubt the amout of trained ham and spam should be near 50%,This is myth. What's important is to have enough of each, the actual ratio is not important.
true - but you don't have much to measure the "enough of each" and so try to keep 50/50 is a good starting point - hence i said "in doubt"
finally you get lest a problem in both cases: * 1% ham samples, 99% spam samples * 1% spam samples, 99% ham samples they bayes occupies a trend
signature.asc
Description: OpenPGP digital signature