Re: BAYES...sitewide or per-user or not at all?

Gerald V. Livingston II 10 Apr 2005 00:10:33 -0000

Thanks Bob,

On Fri, 8 Apr 2005 17:24:05 -0700 Robert Menschel wrote:

> Hello Gerald,
> 
> Thursday, April 7, 2005, 6:58:55 PM, you wrote:
> 
> GVLI> I'm afraid domain wide bayes would show up as many FPs for the
> GVLI> first two groups or many FNs for the last two -- or the database
> 
> It balances out.  Granny puts the porn into her spam box, and Ginger
> puts a graphic discussion of last night's wet dream into her ham box.
> Over time bayes learns which mails everyone thinks is spam, which
> mails everyone thinks is ham, and which mails are undeterminable.

I guess I need to read more on how bayes works.

I'm looking at what scores I'll be able to let my users modify directly. If
they can drop the bayes scores some for individual users it might not be so
bad. I'm trying really hard not to ostracize any specific groups of people
though. Our userbase leans MUCH more heavily to the "non-porn-hound" type
(families and businesses) so that's what has me concerned about site-wide
or domain-wide bayes.

> GVLI> I'm not sure how resource efficient per-user BAYES would be. Will
> it kill
> GVLI> the machine as the user base grows or the spam volume increases?
> 
> per-user Bayes lookups aren't bad -- don't worry about them. The
> question revolves around per-user Bayes database storage (do you have
> enough disk space), and how you manage the sa-learn process.

sa-learn -- anyone have a way to stat() all the SPAM folders and run
sa-learn only on those that have new messages added by customers? I could
find them using 'find' by searching on the mod date but I'd have to have
some way for sa-learn to know the username to run as.

Space I'm not worried about. The machine I'm building "everything" on now
has 250Gig of storage (2*250G drives in RAID1) and will be the primary
location for user mail stores and the SMTP/IMAP/POP3 server for customers
(IMAP only for the webmail interfaces, not direct). At 20M per mailbox 7000
addresses only use 140G if every customer stops using POP3 and lets their
online storage fill to max capacity.

When I can take the other server down I will be moving scanning duties to a
dedicated gateway that will have 2*250G drives in RAID0 striping for speed
rather than data redundancy.

So, now I have to decide where to put the database(s) and how to split them
up. I'm thinking a single database with all required user information would
be best (login, SA prefs, Maildir info, everything) from a configuration
point of view. I'd be able to point all config items to a single database
and relate the tables within.

I'm worried about resources though. Will a single machine striping across 2
spindles be able to handle the I/O in a timely fashion? Should I put the
database(s) on the customer mail machine and just waste the extra space
available on the gateway drives? Should I split the system into multiple
databases with duplicate data for identification (bayes with username +
login with username and storage info). Put one database on the gateway and
another on the mail server?

I'm trying hard to determine what I can set up to allow users to modify
just about anything in their SA settings also. Just as if they had a login
account and could create their own .prefs file -- except this is all going
to be virtual with no home directories -- all in MySQL. I have one customer
who doesn't want ANYTHING from overseas. No APNIC, RIPE, etc. He has some
very carefully crafted PERL regex filters on the current mail server that
mostly do the trick for him and he's going to lose the ability to use those
when we move the server. I want him to be able to pinpoint RBL filters that
score on origination point and bump up those that come from the countries
he definitely wants blocked.

I still have to yank Dovecot off the server and go with Courier. Courier is
more resource intensive but Dovecot isn't quite "ready for prime time" yet.
I need full quota support and it's not there in the stable versions. The
test versions tend to break something every time they fix something else.
I'll probably move back to it when it settles out.

Gerald

Re: BAYES...sitewide or per-user or not at all?

Reply via email to