[spambayes-bugs] [ spambayes-Feature Requests-802341 ] Auto-balancing of ham & spam numbers

SourceForge.net Mon, 05 Dec 2005 01:09:50 -0800

Feature Requests item #802341, was opened at 2003-09-08 21:20
Message generated for change (Settings changed) made by anadelonbrin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Tony Meyer (anadelonbrin)
>Summary: Auto-balancing of ham & spam numbers

Initial Comment:
>From [email protected]

&quot;&quot;&quot;
What about adding a feature to the plug-in that would 
could the number of messages in each training folder, 
then use a random subsample of each folder (spam or 
ham) as necessary to create a balanced training corpus?
&quot;&quot;&quot;

This seems like a reasonable idea (as an option), and 
might work better than the experimental imbalance 
adjustment, which has caused various people difficulties 
(because they are *very* imbalanced).  What do you 
think?

----------------------------------------------------------------------

Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-17 09:23

Message:
Logged In: YES 
user_id=731834

As I mentioned on the main spambayes user mailing list, I am 
going to create a script (in VBA I guess) that will troll through 
your folders and create the desired representative subset of 
messages (as copies) automatically. 

We'll see how it makes the filter perform after training, and if 
people like the feature. I have to figure out how to 
automatically strip attachments from the copies... anyone 
know how to do that in Outlook VBA without destroying the 
headers?

Does anybody have a better idea of how to test this feature?

   -ryan-

----------------------------------------------------------------------

Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-17 03:41

Message:
Logged In: YES 
user_id=731834

I guess my reaction would be: disk space is extraordinarlity 
cheap, and spam messages are generally small. My folder of 
2900 spams takes up only 11.8 MB of space on my Exchange 
server.

I don't think storing &quot;extra&quot; data is a big issue in the single-
user model. In fact, I think an auto-rotating trainig corpus-of-
copies like the scheme used by ASSP (see assp.sf.net) is a 
good idea. It help age things properly, and keeps a balanced 
training set, and lets you empty out your &quot;main&quot; mailbox.

Of course, there is the problem of making SpamBayes training 
sets and databases too large to be &quot;portable&quot;. This is an large 
issue with the Outlook plug-in, since many companies using 
Windows roaming profiles so users can log-in to any machine. 
I also remember similar issues with the NFS/AFS roaming-user 
system we had in college for the enginerring workstations, so 
I'm sure Linux/FreeBSD/UNIX sites could have roaming-user 
problems, too.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-09-16 14:35

Message:
Logged In: YES 
user_id=31435

Yup, I agree it's fraught with dangers.  Note that we'd also 
need to remember which msgs were explicitly trained as 
mistakes or unsures, to help prevent them from getting 
mistreated again.  For example, I have a few strange friends I 
hear from maybe twice a year, and the stuff they send is so 
bizarre I have to keep several years' worth of their msgs in 
my ham training set (and, yes, I do think it's ham &lt;wink&gt;).

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-09-16 13:56

Message:
Logged In: YES 
user_id=552329

Another problem with this is that these require either the user 
keeping spam around, or storing a *lot* more data.  Ryan's 
scheme below is really two separate things - one is aging out 
old data, which has been discussed a few times, and then 
randomly selecting from what's left.

I tend to agree with Mark.  I think this might end up like the 
experimental_ham_spam_imbalance and confuse people.  Why 
doesn't x get a ham score, they ask?  Because it was 
randomly chosen to not get included in your training data, we 
answer.

The more I think about it, the more I think that (unless 
someone comes up with a new, better, 
experimental_ham_spam_imbalance option), the best option is 
simply to warn users if they reach a certain level of 
imbalance, so that their attention is drawn to the problem.

If I find the time, I might play around with setting up a test 
script to train, then retrain on balanced data and see how 
that goes.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-09-16 01:16

Message:
Logged In: YES 
user_id=14198

My problem is more with missing ham, and I fear that missing
a single ham could make the difference.  Our low
false-positive rate is a feature we should keep :)

It all gets back to the test framework.  As Tim is fond of
saying, intuition is a poor guide here.

----------------------------------------------------------------------

Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-16 00:37

Message:
Logged In: YES 
user_id=731834

The last sentance under part 1) below should read &quot;So we 
choose our cutoff date to be 5/13/2003.&quot;

----------------------------------------------------------------------

Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-16 00:35

Message:
Logged In: YES 
user_id=731834

Since I initially came up with this possible feature on 
the mailing list, let me add my two cents. I don't think 
throwing out any &quot;super-spam&quot; is the right approach, since 
there might be some useful &quot;almost-spam&quot; information in 
there. A spam might score 100% because it 
contains 'viagra' and 'lowest' and 'price', fine, and we 
already know about those tokens. But the same &quot;super-
spammy&quot; message might contian a new domain name, or a new 
word like &quot;silagra&quot;; basically any other information that 
is useful in the training database.

That said, I think a good algorithm might be based on 
dates, to make sure the sampling is representative. I 
suggest looking at the received date of the oldest message 
in each corpus, and choosing the most recent of these 
dates. Then we can count all messages from each corpus 
that are newer than this date, and finally, take a random 
subsample of the messages from the corpus which has &quot;more&quot; 
new messages. The subsampling can be done on the fly by 
using an RNG, you might get an error of a few messages in 
each direction, but it won't affect the statistics 
materially and will be easier to implement than keeping 
track of a bunch of message-ids.

An example of my proposal:
1) Spam corpus: 1342, oldest is dated 5/13/2003; Ham 
corpus: 6203, oldest is dated 6/19/2002. So we choose our 
cutoff date to be 5/13/2002.
2) We already know there are 1342 messages in the spam 
corpus newer than this date. We also count up 2987 
messages in the ham corpus newer than this date. So we 
want to choose 1342/2987=46.324% of the messages from the 
ham corpus newer than 5/13/2003.
3) We tokenize and traing on the whole spam corups. Then 
we start through the ham corpus, skipping all messages 
older than 5/13/2003. If we come across a message newer 
than that, we choose a random number between 0 and 1. If 
the random number is less than 0.46324, we train wiht the 
message. At most we should be off by a few dozen messages 
from the desired 1342 trained ham.

This method gives us a balanced training set, with 
representative spam and ham messages from the same time-
frame. What do you think?

Regards,
   -Ryan-
 




----------------------------------------------------------------------

Comment By: Leonid (leobru)
Date: 2003-09-13 15:02

Message:
Logged In: YES 
user_id=790676

I don't know if it is a generally good idea or not, but I
forward everything that scores as 1.00 spam directly to
/dev/null (this way there is no way to train on it). This
effectively implements the idea &quot;do not train on VERY spammy
spam&quot;. Works for me; about 80% of all messages (or 90% of
all spam) is immediately thrown away, and the ham/spam
numbers do not get skewed. 3 months, and not a single
non-spam mass mailing in my spam box (in &quot;unsure&quot; in the
worst case). 

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-09-09 01:09

Message:
Logged In: YES 
user_id=14198

This isn't Outlook specific, so you can have it back :)  The
big problem I see is *what* ones to choose?  Skipping spam
may be possible, but skipping a single ham to train on could
be a huge problem.

Maybe we could train on all spam, then score all spam, then
re-train using only the least spammy spam - but I think the
answer to
http://spambayes.sourceforge.net/faq.html#why-don-t-you-implement-cool-tokenizer-trick-x
may be relevant &lt;wink&gt;

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702
_______________________________________________
Spambayes-bugs mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-bugs

[spambayes-bugs] [ spambayes-Feature Requests-802341 ] Auto-balancing of ham & spam numbers

Reply via email to