Greg> What I did last night was construct nearly equal size mailboxes
Greg> containing spam and non-spam. Each one had about 1600 messages in
Greg> it.
Why not try a simpler experiment? Start with the five most recent ham and
spam messages you've received. Only add mistakes (false positives or false
negatives) to those sets for awhile. Be judicious in what you train on. If
you get a misclassified ham or spam, train on it. On the other hand, if you
get a bounce message from someplace complaining about a spam you purportedly
sent (and which includes the spam itself), just delete it for now. In my
experience, administrative messages like this have a confusing set of hammy
and spammy clues. Deal with them later after the dust has settled a bit.
Try that approach for awhile and see if the performance doesn't improve
fairly dramatically over a short period of time.
I'm always suspicious when someone says, "I started with N messages ...",
where N is large. I know from my own experience that it's all too easy to
make a classification mistake, just because I'm often not paying close
enough attention when I'm trying to rifle quickly through my inbox. There's
no way I would correctly classify 3200 messages. Some mistake would always
slip in. In addition, a training database can often collect multiples of
essentially the same message. While you might have equal numbers of hams
and spams in your database, if 1500 of the spams (to pick an absurdly
extreme number) are essentially the same spam the coverage of the overall
spam token space isn't going to be very uniform. I'm not smart enough to
know that this would be a problem, but my intuition tells me it might.
If you are running on a Unix-ish system and have your training databases in
Unix mbox-format files, I wonder if you wouldn't indulge my curiosity and
run tte.py over them? Assuming your hams and spams are in ham.mbox and
spam.mbox, something like the following should work:
python /path/to/tte.py -g ham.mbox -s spam.mbox -p tte.pck -c .cull -v
The tte.py script is in the contrib directory of the distribution. The
above command line args identify the ham and spam mailboxes, tells it to
create a pickle output file, and to create two new files, ham.mbox.cull and
spam.mbox.cull, from which will be eliminated all messages that when scored
on every pass were properly classified as ham or spam. (Running tte.py with
the --help flag should give a reasonable amount of help.)
The output will look something like this:
round: 1, msgs: 376, ham misses: 154, spam misses: 182, 18.1s
round: 2, msgs: 376, ham misses: 32, spam misses: 8, 10.1s
round: 3, msgs: 376, ham misses: 4, spam misses: 2, 8.8s
round: 4, msgs: 376, ham misses: 3, spam misses: 2, 9.0s
round: 5, msgs: 376, ham misses: 5, spam misses: 1, 8.8s
round: 6, msgs: 376, ham misses: 2, spam misses: 1, 8.6s
round: 7, msgs: 376, ham misses: 1, spam misses: 1, 8.6s
round: 8, msgs: 376, ham misses: 0, spam misses: 0, 8.5s
29 untrained spams
writing new ham mbox...
186 of 188
writing new spam mbox...
216 of 218
For your training database it will obviously display many more messages.
I'm mostly interested in knowing what those last few lines look like. How
many fewer messages are written to the .cull files?
Skip
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html