On Tue, 13 Feb 2018, Horváth Szabolcs wrote:

3. populate the ham database

That's the tricky part. As I mentioned earlier, I don't really want end-users involved in this.

You might be able to find a few that are somewhat technically competent and don't mind their ham samples being manually reviewed.

One more question: is there a recommended ham to spam ratio? 1:1?

I suggest "try to match your ham:spam ratio at your MTA before filtering", but others may have different advice. Generally: the more *reliable* data you can feed Bayes, the better it does.

I'm thinking about if you see my "populating the ham database automatically with the outgoing emails" idea as a complete nonsense, then I would find sysadministrator resource to collect 2000 legit emails and train those mails as hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if 2000 legit emails are enough for training)

2000 is enough to start, but it would have to be ongoing as the nature of mail changes over time.

Generally training on misclassifications is what you do after the initial training. So if a ham drops into a user's quarantine folder, you'd want to train that as ham.

