Feature Requests item #802341, was opened at 2003-09-08 21:20 Message generated for change (Settings changed) made by anadelonbrin You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Priority: 5 Submitted By: Tony Meyer (anadelonbrin) Assigned to: Tony Meyer (anadelonbrin) >Summary: Auto-balancing of ham & spam numbers Initial Comment: >From [email protected] """ What about adding a feature to the plug-in that would could the number of messages in each training folder, then use a random subsample of each folder (spam or ham) as necessary to create a balanced training corpus? """ This seems like a reasonable idea (as an option), and might work better than the experimental imbalance adjustment, which has caused various people difficulties (because they are *very* imbalanced). What do you think? ---------------------------------------------------------------------- Comment By: Ryan Malayter (rmalayter) Date: 2003-09-17 09:23 Message: Logged In: YES user_id=731834 As I mentioned on the main spambayes user mailing list, I am going to create a script (in VBA I guess) that will troll through your folders and create the desired representative subset of messages (as copies) automatically. We'll see how it makes the filter perform after training, and if people like the feature. I have to figure out how to automatically strip attachments from the copies... anyone know how to do that in Outlook VBA without destroying the headers? Does anybody have a better idea of how to test this feature? -ryan- ---------------------------------------------------------------------- Comment By: Ryan Malayter (rmalayter) Date: 2003-09-17 03:41 Message: Logged In: YES user_id=731834 I guess my reaction would be: disk space is extraordinarlity cheap, and spam messages are generally small. My folder of 2900 spams takes up only 11.8 MB of space on my Exchange server. I don't think storing "extra" data is a big issue in the single- user model. In fact, I think an auto-rotating trainig corpus-of- copies like the scheme used by ASSP (see assp.sf.net) is a good idea. It help age things properly, and keeps a balanced training set, and lets you empty out your "main" mailbox. Of course, there is the problem of making SpamBayes training sets and databases too large to be "portable". This is an large issue with the Outlook plug-in, since many companies using Windows roaming profiles so users can log-in to any machine. I also remember similar issues with the NFS/AFS roaming-user system we had in college for the enginerring workstations, so I'm sure Linux/FreeBSD/UNIX sites could have roaming-user problems, too. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-09-16 14:35 Message: Logged In: YES user_id=31435 Yup, I agree it's fraught with dangers. Note that we'd also need to remember which msgs were explicitly trained as mistakes or unsures, to help prevent them from getting mistreated again. For example, I have a few strange friends I hear from maybe twice a year, and the stuff they send is so bizarre I have to keep several years' worth of their msgs in my ham training set (and, yes, I do think it's ham <wink>). ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-09-16 13:56 Message: Logged In: YES user_id=552329 Another problem with this is that these require either the user keeping spam around, or storing a *lot* more data. Ryan's scheme below is really two separate things - one is aging out old data, which has been discussed a few times, and then randomly selecting from what's left. I tend to agree with Mark. I think this might end up like the experimental_ham_spam_imbalance and confuse people. Why doesn't x get a ham score, they ask? Because it was randomly chosen to not get included in your training data, we answer. The more I think about it, the more I think that (unless someone comes up with a new, better, experimental_ham_spam_imbalance option), the best option is simply to warn users if they reach a certain level of imbalance, so that their attention is drawn to the problem. If I find the time, I might play around with setting up a test script to train, then retrain on balanced data and see how that goes. ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-09-16 01:16 Message: Logged In: YES user_id=14198 My problem is more with missing ham, and I fear that missing a single ham could make the difference. Our low false-positive rate is a feature we should keep :) It all gets back to the test framework. As Tim is fond of saying, intuition is a poor guide here. ---------------------------------------------------------------------- Comment By: Ryan Malayter (rmalayter) Date: 2003-09-16 00:37 Message: Logged In: YES user_id=731834 The last sentance under part 1) below should read "So we choose our cutoff date to be 5/13/2003." ---------------------------------------------------------------------- Comment By: Ryan Malayter (rmalayter) Date: 2003-09-16 00:35 Message: Logged In: YES user_id=731834 Since I initially came up with this possible feature on the mailing list, let me add my two cents. I don't think throwing out any "super-spam" is the right approach, since there might be some useful "almost-spam" information in there. A spam might score 100% because it contains 'viagra' and 'lowest' and 'price', fine, and we already know about those tokens. But the same "super- spammy" message might contian a new domain name, or a new word like "silagra"; basically any other information that is useful in the training database. That said, I think a good algorithm might be based on dates, to make sure the sampling is representative. I suggest looking at the received date of the oldest message in each corpus, and choosing the most recent of these dates. Then we can count all messages from each corpus that are newer than this date, and finally, take a random subsample of the messages from the corpus which has "more" new messages. The subsampling can be done on the fly by using an RNG, you might get an error of a few messages in each direction, but it won't affect the statistics materially and will be easier to implement than keeping track of a bunch of message-ids. An example of my proposal: 1) Spam corpus: 1342, oldest is dated 5/13/2003; Ham corpus: 6203, oldest is dated 6/19/2002. So we choose our cutoff date to be 5/13/2002. 2) We already know there are 1342 messages in the spam corpus newer than this date. We also count up 2987 messages in the ham corpus newer than this date. So we want to choose 1342/2987=46.324% of the messages from the ham corpus newer than 5/13/2003. 3) We tokenize and traing on the whole spam corups. Then we start through the ham corpus, skipping all messages older than 5/13/2003. If we come across a message newer than that, we choose a random number between 0 and 1. If the random number is less than 0.46324, we train wiht the message. At most we should be off by a few dozen messages from the desired 1342 trained ham. This method gives us a balanced training set, with representative spam and ham messages from the same time- frame. What do you think? Regards, -Ryan- ---------------------------------------------------------------------- Comment By: Leonid (leobru) Date: 2003-09-13 15:02 Message: Logged In: YES user_id=790676 I don't know if it is a generally good idea or not, but I forward everything that scores as 1.00 spam directly to /dev/null (this way there is no way to train on it). This effectively implements the idea "do not train on VERY spammy spam". Works for me; about 80% of all messages (or 90% of all spam) is immediately thrown away, and the ham/spam numbers do not get skewed. 3 months, and not a single non-spam mass mailing in my spam box (in "unsure" in the worst case). ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-09-09 01:09 Message: Logged In: YES user_id=14198 This isn't Outlook specific, so you can have it back :) The big problem I see is *what* ones to choose? Skipping spam may be possible, but skipping a single ham to train on could be a huge problem. Maybe we could train on all spam, then score all spam, then re-train using only the least spammy spam - but I think the answer to http://spambayes.sourceforge.net/faq.html#why-don-t-you-implement-cool-tokenizer-trick-x may be relevant <wink> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702 _______________________________________________ Spambayes-bugs mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-bugs
