On Thu, 15 Feb 2018, RW wrote:

On Thu, 15 Feb 2018 11:56:55 -0600 (CST)
sha...@shanew.net wrote:
So, the sample size doesn't matter when calculating the probability of
a message being spam based on individual tokens, but it can matter
when we bring them all together to make a final calculation.

It's not a matter of how they combine, smaller counts just lead to
less accurate token probabilities.

I'm not saying that it doesn't matter how much you train, I'm saying
that if you have enough spam and enough ham Bayes is insensitive to
the ratio.

I agree that past a certain minimum threshold, the ratio doesn't
matter much.  But as I understand it, larger sample size makes a
difference.

I haven't checked the math in the Bayes plugin, but it explicitly
mentions using the "chi-square probability combiner" which is
described at http://www.linuxjournal.com/print.php?sid=6467

Maybe I'm misunderstanding what that article describes, but I'm pretty
sure what it boils down to is that when the occurence of a token is
too small (he uses the phrase "rare words") it can lead to
probabilities at the extremes (like a token that occurs only once and
is in spam, so its probability is 1).  The way to address these
extremely low or extremely high probabilities is to use the Fisher
calculation (which is described in the second page of the article).

Maybe this is where I'm making a logical leap that I shouldn't, but I
think that "non-rare words" increasingly outnumber "rare words" as the
sample size of messages (and thus tokens) increases.


--
Public key #7BBC68D9 at            |                 Shane Williams
http://pgp.mit.edu/                |      System Admin - UT CompSci
=----------------------------------+-------------------------------
All syllogisms contain three lines |              sha...@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew

Reply via email to