"Seth Goodman" <[EMAIL PROTECTED]> writes:

> [EMAIL PROTECTED] <> wrote on Saturday, February 03, 2007
> 3:15 PM -0600:
>
>> I'm getting what the title says.  I very rarely see ham classified as
>> unsure, and I get a few hundred unsures per day.  I keep training on
>> the unsures, which means my database accumulates lots more spam than
>> ham over time.  Is there anything I can do to help reduce the number
>> of messages classified as unsure without hurting Spambayes' ability to
>> correctly recognize ham?
>
> If your training set has much more spam than ham, you can train on ham
> that already scores properly.  

That'll help?  Great; it's easy enough.

> Whether you choose ham that scores very low already (typical ham) or
> the highest scoring ham (unusual ham) is your preference.

Are you suggesting that it makes no difference?

> If you use the Outlook plugin, 

No offense to all the Outlook users out there, but I avoid it like the
plague.  I'm using sb_imapfilter and doing the filtering server-side.

> just move the ham you want to train on to the unsure folder and tell
> Spambayes it's not spam.  How much trained ham/spam imbalance is too
> much is also up for debate.  Some people have reported good results
> with 5:1 and even 10:1 imbalance, while others do poorly under those
> conditions.  

Sounds pretty indefinite.  What's poorly mean?

> I try to avoid mine going further than 2:1 and train on
> my highest scoring ham to fix it.  This seems to work better for me
> than training only on unsures.

I don't get nearly enough unsures that are ham to correct the
imbalance that way.

> Another underappreciated issue with all self-learning classifiers is
> that they are very sensitive to training mistakes.  Training a couple of
> messages in the wrong category can really change the outcome, 

I've noticed.

> and the Outlook plugin doesn't tell you which messages are trained
> and whether you trained them as ham or spam.

Fortunately my server-side scheme tells me that easily enough.

> You have to figure this out indirectly, usually by rescoring all
> your messages and looking for obvious errors.  With a large set of
> messages, the likelihood of spotting a training mistake goes down.
> Fortunately, it's not hard to start from scratch, so this is a
> reasonable thing to try if things are not working as well as they
> should.
>
> Please let us know what you try, what helps and what doesn't.

I will, but aren't you afraid there are just too many levers to pull,
what with all the configuration options and legit approaches to
training?  Seems like it would be hard to learn much from user
feedback.

Thanks,

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to