On Sat, 9 Nov 2013, Sergio Durigan Junior wrote:
On Saturday, November 09 2013, Karsten Bräckelmann wrote:
You don't have any kind of archive of spam? If so, train on recent ones,
feel free to exceed the minimum limit, but don't bother too much with
old spam. It changes much faster over time than ham does.
Also, at least until you reached the minimum required training, do train
with identified spam, too. Same with ham. For now, keep training in a
ratio somewhere between 1:1 or spam to ham ratio.
[Note: By ham I assume you mean false-positives, and not just regular
e-mail.]
No, (un)fortunately I don't. I've been running this server for 5 months
now, and only received about 10 spams so far. I decided to start
running SA now because I've received 5 spams in the last 3 days, which
triggered my internal alarm.
Do train. Spam, as well as ham. If you got some recent-ish archives.
Will do. However, I don't have false-positives (ham) to train. As I
said above, I only have about 10 spam messages, which I already used to
train Bayes. Not sure if it is possible/would be good to search for
recent spam archives on the net. I believe not...
For Bayes to work it needs at least 200 examples of Ham (e-mail that
you want) and 200 examples of Spam (e-mail that you don't want).
It doesn't matter if the messages were correctly or not correctly
classified by the rules-based SA engine, just what you consider
Ham/Spam (IE correctly classified by -you-).
In essence you are "teaching" the Bayes system how to recognize
your preferences in e-mail classifying.
So the messages you've kept in your INBOX should be good for Ham.
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{