Reindl Harald [] wrote:

> > However, that doesn't happen.
> > 0.000          0     338770          0  non-token data: nspam
> > 0.000          0    1460807          0  non-token data: nham

> what do you expect when you train 4 times more ham than spam?
> frankly you "flooded" your bayes with 1.4 Mio ham-samples and i thought 
> our 140k total corpus is large - don' forget that ham messages are 
> typically larger than junk trying to point you with some words to a URL
> 108897   SPAM
> 31492    HAM

This is a production mail gateway serving since 2015. I saw that a few messages 
(both hams and spams) automatically learned by amavisd/spamassassin. Today's 

   3616 autolearn=ham
  10076 autolearn=no
   2817 autolearn=spam
    134 autolearn=unavailable

I think I have no control over what is learnt automatically.

Let's just assume for a moment that 1.4M ham-samples are valid.
Is there a ham:spam ratio I should stick to it? I presume if we have a 1:1 
ratio then future messages won't be considered as spam as well.


Reply via email to