Am 23.02.2015 um 00:11 schrieb RW:
On Fri, 20 Feb 2015 21:36:38 +0100
Reindl Harald wrote:
And I'd suggest the same for non-spam, train duplicative ham even
if it happens to be similarly addressed to different users. More
data is (nearly) always better for bayesian learning systems

of course

With the caveat that you keep an eye on retention.

of course, or you disable autoexpire and autolearning in case of hand-maintained bayes

i for myelf don't trust any automatism in that case because it leads easily in train false positives as well as false negatives or destroys the ham/spam balance in one or the other direction

been there, done that, the results can be both:

* spam detection becomes over the time unrelieable
* most mails, especially newsletters take spam direction

in doubt the amout of trained ham and spam should be near 50%,

This is myth. What's important is to have enough of each, the actual
ratio is not important.

true - but you don't have much to measure the "enough of each" and so try to keep 50/50 is a good starting point - hence i said "in doubt"

finally you get lest a problem in both cases:

* 1% ham samples, 99% spam samples
* 1% spam samples, 99% ham samples

they bayes occupies a trend

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to