Tim Peters wrote:
> [Brendon Whateley]
>   
>> I've just started using spambayes again after a while away from it.
>> Now, 3 days in, I notice that I've trained on far more spam than ham.
>> (Total emails trained: Spam: *432* Ham: *64)  I seem to remember that
>> this was previously my experience in the past.
>>
>> My question is; has anybody really tested the assertion that leads to
>> the message: "**Warning: you have much more spam than ham - SpamBayes
>> works best with approximately even numbers of ham and spam."?*
>>     
>
> Yes, but by the time you and Tony wrote your paper, serious
> multi-corpus testing had long since essentially stopped.  The results
> with large imbalances were so dramatically worse that I introduced the
> infamous "experimental ham spam imbalance adjustment" switch, which
> tried to stop "the math" from drawing absurdly confident conclusions
> from wildly unbalanced data (see the thread Mark pointed out).  The
> results of that were a mixed bag, helping some people a little but
> hurting others more, so we dropped it.
>   
Yes I remember that.  I can also guess why serious multi-corpus testing
stopped... as I recall, the pain of putting them together is not for the
faint of heart :)
> As I'm sure one of the text files in the project says, /all/ decisions
> "should be" reevaluated periodically.  Alas, a one-corpus test is
> essentially useless, and it was hard even some years ago to arrange
> for multi-corpus tests.
>   
In the worst case, I can satisfy my own curiosity and possibly provide
some insight.  I may be able to gather several different corpora  for
some testing.  How many separate corpora would you consider a valid test?
> When the original testing was done, almost all spam was text-heavy,
> meaning lots of tokens were generated.  The paucity of tokens
> generated for more recent image-based spam, and spam hiding in
> attachments, makes SB's basic /approach/ less useful for that kind of
> spam.  No real idea how imbalance affects scoring spam of that kind.
>   
That is the thinking that lead to my question of the imbalance effect. 
Perhaps some method of generating tokens from images would restore order
to our world.
> The only thing I've done in response to it is lower my "spam
> threshold", down to 70 now, with ham at 5.  My unsure rate is about
> 6%, most of which are spam.  Every now and again I add the 10 most
> recent ham to my ham training data, but even so I've got about a 3:1
> spam:ham training ratio.  I do expect my stats would improve if I
> added more ham (I'm one of the ones the old imbalance option helped),
> but I spend so little time looking at unsures it's just not worth even
> tiny efforts to improve it.
At the very least I can test your approach vs what I've been doing which
is to just let the imbalance grow until some ham gets pulled into
unsure.  At that point I add unsure ham and continue on.  At the very
least, that answer may be of some help to those who find their training
leads to large imbalances.

When I get back, I'll start playing with this and see if anything useful
develops.
Brendon.

_______________________________________________
SpamBayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to