Re: OT? Ethics, privacy, and getting a bigger corpus

Ryan Thompson 8 May 2004 00:50:34 -0000

Good thoughts.
I'll reply in-line to both of you in this email. Read through...

Justin Mason wrote to jdow:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> jdow writes:
>
> > I have sympathies for your problem. As a user I can see allowing you
> > to use something like a line in procmail to clone any email that is
> > tagged as a virus with a high enough score, say 10 or 15, to be used
> > to maintain a corpus. I'd have sincere problems with the idea of
> > your pulling messages to build a ham corpus, though.

Yes. I would *expect* users to have a problem with that. I guess the big
issue is whether the benefit is worth it for enough users for them to
volunteer their incoming mail to the cause.

Maybe the better question to ask would be *why* would it be a problem
for you? Many concerns can be addressed with a more thorough
explanation, or stricter controls on staff member access.

Like, what if we *only* looked at headers for almost all messages
(because, as you'd probably agree, the vast majority of emails are
hand-classifiable based on headers alone), and (now I'm just thinking
out loud), generate kind of an inverse-quarantine based on which emails
staff members would like to see the full-text of. Send the user an email
every day (or week, or whatever) with the senders, subject lines, and
a unique hash for each message. Then:
    a) Ask them to classify those messages themselves
or  b) Have them reply to the message, deleting lines containing
       messages that they'd not like us to see. Our staff will be
       granted access to the remainder.

That might be a reasonable compromise. The "inverse quarantine" for most
users would likely only be a couple of messages here and there, so it
wouldn't be a big pain to classify.

Then, we just make explicit the fact that our staffers only access the
fulltext of messages that they've been given this explicit permission
to. Our learning system (sa-learn, plus some other excellent
rule-generators of my own invention) would, of course, read the fulltext
of all messages, but humans wouldn't.

Admittedly, this isn't perfect, either... It'd be *much* easier and
nicer (for us) to simply FCC: every incoming email to an IMAP folder,
but that wouldn't go over so well. :-)

> > So you'd end up with a really huge spam corpus and a puny and likely
> > biased ham corpus.
>
> Worse -- the spam corpus would be entirely biased towards a particular
> sub-set of spam (the easily identifiable stuff).  That's no good...

Yeah... That's no better (probably worse) than SA's autolearning. The
whole point of doing something like this would be to get a
representative sample of ham and spam... i.e., a good corpus does not
favour any particular subset (score range, for instance) of messages, be
they ham or spam.

> > And then you'd still be faced with the task of processing all these
> > diverse spam captures to eliminate duplicates.
>
> Well, duplicates aren't a huge problem really.  Getting rid of those
> is a best-case scenario.

Right.

I guess, the nice part would be that, given enough mail, it would be
pretty easy to take a random sample of it, to get a corpus of
essentially any size we want. Statistically, that's *much* better than
taking only a day's worth, for instance.

- Ryan

-- 
  Ryan Thompson <[EMAIL PROTECTED]>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Re: OT? Ethics, privacy, and getting a bigger corpus

Reply via email to