For some time now, I've had my email system set up so that I have two "trash" folders. One for ham-trash, and one for spam-trash. Hourly, my system goes through them and uses them to update the Bayes database. However, the script that does this *also* records the overall spam score it received as well as whether it was found in the spam trash or the ham trash.

Now that I have a lot of data, I have written a script that tallies it all up and, rather than picking a spam-threshold score, let's me merely indicate the false-positive or false-negative rate that I prefer... and then the script figures out what score I need.

The idea is that I would indicate my false-positive or false-negative preference, and then the script could run once a week, for example, and adjust my spam-threshold in my SA user preferences.

Since I'm considering putting this into a complete "auto-tuning" kit for SA, I'm interested in hearing some suggestions.

Right now, my idea is that it would be used through some user-configuration webpage. As such, the user would need to be presented with some scenarios. For that purpose, the script can show you scenarios for a few false-positive and false-negative rates, like this sample output shows. The first three aim for false-positive rates of 1-in-10, 1-in-100, and 1-in-1000, while the next three aim for the same for false-negatives:

  Spam-Threshold: 0.3
  Ham messages lost: 1 in every 10.02
  Spam messages allowed: 1 in every 241.92

  Spam-Threshold: 8.2
  Ham messages lost: 1 in every 118.20
  Spam messages allowed: 1 in every 29.58

  Spam-Threshold: 15
  Ham messages lost: 1 in every 99999.00
  Spam messages allowed: 1 in every 2.44

  Spam-Threshold: 10
  Ham messages lost: 1 in every 147.75
  Spam messages allowed: 1 in every 10.32

  Spam-Threshold: 5.7
  Ham messages lost: 1 in every 45.46
  Spam messages allowed: 1 in every 87.74

  Spam-Threshold: -5.8
  Ham messages lost: 1 in every 1.04
  Spam messages allowed: 1 in every 266.18


Now, so that this data would be easy for a cgi script to use in a web page, it also outputs the data in comma-separated format, in the format of:
"score,<one-FP-in-every-X-messages>,<one-FN-in-every-X-messages>"
0.3,10,241
8.2,118,29
15.0,99999,2
10.0,147,10
5.7,45,87
-5.8,1,266

Now, to get a spam-threshold for, say, one FP in every 500 messages, you might pass it a command-line argument of "FP:500" and it would just spit you back a single number. Same would go for a false-negative... passing something like "FN:500".

Does anybody else out there envision other ways to use this script? Are there any other features it should have?

- Joe

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature



Reply via email to