Re: OT? Ethics, privacy, and getting a bigger corpus

jdow 7 May 2004 23:28:40 -0000

From: "Ryan Thompson" <[EMAIL PROTECTED]>
> Hi all,
> 
> This is likely off-topic... but I think it's likely an interesting
> topic to many of the people on this list.
> 
> Like many of you, I run SpamAssassin system-wide, for thousands of
> users, whose privacy I have promised to uphold. Thus, we can't just go
> gallivanting around everyone's mailboxes to build a larger corpus of
> spam and ham. Our system is getting *plenty* of email to build a really
> beautiful corpus, upon which I could design rules which would increase
> our accuracy to near-100% (from 92-95% currently). The problem is, we
> only have sanctioned access to about 3% of the email that comes through
> our system, and the only time users forward email to help our filters
> out is when they find an uncaught spam, which doesn't help much.
> 
> To make a long story short, I've been forever toying with the idea of
> allowing users (on a strictly voluntary basis!) to permit duly appointed
> employees of our company to manually scan and classify their email. In
> other words, I'd be asking people for permission to routinely read their
> mail.
> 
> Technically, it's easy to implement. From a business sense, though, I
> definitely have reservations about something like this. Sure, it's
> voluntary, but even having to *ask* for something like this might well
> ruffle the feathers of the more conservative clients we have. Then
> again, many users will probably appreciate the opportunity to have us
> "on the case", especially since the accuracy of our filtering will
> improve measurably as a result.
> 
> Has anyone tried anything like this? How did your clients react? Did you
> have better or worse luck with different wording/approaches?
> 
> Any feedback would be much appreciated. If you feel this is way-OT for
> the list, off-list replies are fine, too.
> 
> Thanks,
> - Ryan


I have sympathies for your problem. As a user I can see allowing you to
use something like a line in procmail to clone any email that is tagged
as a virus with a high enough score, say 10 or 15, to be used to maintain
a corpus. I'd have sincere problems with the idea of your pulling
messages to build a ham corpus, though. So you'd end up with a really
huge spam corpus and a puny and likely biased ham corpus.

And then you'd still be faced with the task of processing all these
diverse spam captures to eliminate duplicates.

{O.O}   (Yeah, I am a downer. But somewhere in Engineering school I
        learned to make immediate worst case assessments of ideas. It
        is a hard to break habit.)

Re: OT? Ethics, privacy, and getting a bigger corpus

Reply via email to