I've had this problem several times when I fed spamassassin several hundred messages at once. The only solution I found (admittedly quite an annoying one) was to blow away the old database files and retrain from scratch - in smaller increments than previously so as not to break it again.
I have no clue why it happens. Sometimes I can get away with large numbers of messages, sometimes I can't. Generally, I try to stick to no more than one or two hundred at a time so as to avoid the issue entirely.
I've been training it with 2000 to 3000 spam and 100 to 300 ham every week for the last ~7 weeks with no problems.
I just broke my ham down into smaller batches, and it didn't learn from anything:
# sa-learn --ham --mbox Ham-001.mbox Learned from 0 message(s) (450 message(s) examined).
# sa-learn --ham --mbox Ham-002.mbox Learned from 0 message(s) (541 message(s) examined).
# sa-learn --ham --mbox Ham-003.mbox Learned from 0 message(s) (540 message(s) examined).
# sa-learn --ham --mbox Ham-004.mbox Learned from 0 message(s) (542 message(s) examined).
# sa-learn --ham --mbox Ham-005.mbox Learned from 0 message(s) (540 message(s) examined).
# sa-learn --ham --mbox Ham-006.mbox Learned from 0 message(s) (214 message(s) examined).
So if I just delete bayes_seen and bayes_toks and then re-train that will solve my problems? Should I set use_bayes to 0 while I'm deleting / retraining to not make stuff barf?
Thx.
-JR
