David Jones [mailto:djo...@ena.com] wrote:
> With non-English email flow, it's more challenging. If no RBLs hit, then you
> really must train your Bayes properly which requires some way to accurately
> determine the ham and spam. You must keep a copy of the
ham and spam corpi and be allowed to review suspicious email.
I really appreciate you to take time helping on this.
Yes, I can confirm that we usually have issues with Hungarian spams. English
spams often caught by the default rules.
As far as I understood today, I need to re-build the bayes database from
1. turn off autolearning
2. populate then spam database
Guys behind the http://artinvoice.hu/spams/ site are doing an excellent work,
they publish catched spams in mbox format
I checked, many spam e-mails that was sent for investigation are in their mbox.
3. populate the ham database
That's the tricky part. As I mentioned earlier, I don't really want end-users
involved in this. And I don't have the necessary resource to do that manually.
I assume I can hack something into the mailflow to copy all outgoing e-mails to
a separate mailbox and - we'll assume that every outgoing e-mail are hams -
these mails are learnt.
That should do it?
End-users are working in a heavily controlled environment (both technically and
legally), in the last ten years, we haven't experienced spams that were sent
from inside. That's why I would blindly trust outgoing emails as hams.
One more question: is there a recommended ham to spam ratio? 1:1?
I'm thinking about if you see my "populating the ham database automatically
with the outgoing emails" idea as a complete nonsense, then I would find
sysadministrator resource to collect 2000 legit emails and train those mails as
hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if
2000 legit emails are enough for training)