I put together some test databases today using spam received in the past week or so (about 1800 messages) and a reasonable cross-section of my ham (all saved python-related mail plus my regular non-specific mailbox, about 2300 messages) and did some 5x5 cross-validation tests (that's the correct term, right?). For the control test I set all these options False:
x-lookup_ip x-short_runs x-image_size x-crack_images but otherwise used my standard configuration. I then made four runs, setting one option True for each run, then compared each test with the control run. The results are summarized briefly below. control v. x-lookup_ip ---------------------- false positive percentages 0.000 0.000 tied 0.217 0.217 tied 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.412 tied 4.533 4.533 tied 4.222 4.222 tied won 0 times tied 5 times lost 0 times control v. x-short_runs ----------------------- false positive percentages 0.000 0.000 tied 0.217 0.217 tied 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.412 tied 4.533 4.533 tied 4.222 4.222 tied won 0 times tied 5 times lost 0 times control v. x-image_size ----------------------- false positive percentages 0.000 0.000 tied 0.217 0.434 lost +100.00% 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 4 times lost 1 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.118 won -6.66% 4.533 4.533 tied 4.222 3.958 won -6.25% won 2 times tied 3 times lost 0 times control v. x-crack_images ------------------------- false positive percentages 0.000 0.000 tied 0.217 0.217 tied 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.118 won -6.66% 4.533 3.966 won -12.51% 4.222 3.430 won -18.76% won 3 times tied 2 times lost 0 times I didn't do anything to verify the accuracy of my spam and ham data. I'm doing that now. Also, the fact that the first two tests were identical to the control seems a bit suspicious, so I'm going to try them again after picking over my training database. Still, the image_size and crack_images runs look promising, perhaps because my recent spam is so full of these pump and dump spams. Skip _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev