On Thu, 15 Feb 2018, RW wrote:
On Thu, 15 Feb 2018 00:01:18 +0100
Reindl Harald wrote:
Am 14.02.2018 um 23:07 schrieb RW:
My point is that an imbalance doesn't create a bias
wrong - what you tried to say was "doesn't necessarily create a bias"
- but in fact when the imbalance is too big *it does*
simply think about how bayes works makes that clear: eahc word a
token with ham/spam counter - when you have 1 Mio of one type and
10000 of the other type guess how that counter start to get biased
As I said, Bayes is based on frequencies.
If a token occurs in 10% of ham and 0.5% of spam based on 10,000 hams
and 10,000 spams, what do you think is likely to happen to those
percentages with 10,000 hams and 1,000,000 spams?
Perhaps it would help to state Bayes' formula explicitly.
The probabality that a message is spam given a specific token is equal
to:
(the probabilty of a token occuring in spam) times (the probability
that a message is spam) divided by (the probabilty of that token
occuring in all messages)
The important feature in this formula is that every value being
operated on is a probability, so if a given token occurs in .5% of
10,000 spams, we would expect it to occur in .5% of 100,000 or
1,000,000. If that assumption is true, and the .5% probability
doesn't change, the resulting calculated probability also doesn't
change.
For actual spam detection, this is complicated by the fact that we end
up with a whole stack of calculated probabilites for each token
(including the probabilities that a message is non-spam given specific
tokens), and we have to take all of them into account to calculate a
final probability. In this process, it's not unusual that some
individual calculated probablities "matter" more than others, and one
basis for how much weight a particular probability gets is how much we
can trust that probability. Here's where the 10,000 vs. 1,000,000
comes into play, because we can rely on the .5% probability out of
1,000,000 samples more than we can the .5% probability out of 10,000
samples, and both of those are better than a .5% probability out of
100 samples (that said, the difference in trust increases more between
100 samples and 10,000 samples than from 10,000 samples to 1,000,000
samples due to diminishing return).
So, the sample size doesn't matter when calculating the probability of
a message being spam based on individual tokens, but it can matter
when we bring them all together to make a final calculation.
--
Public key #7BBC68D9 at | Shane Williams
http://pgp.mit.edu/ | System Admin - UT CompSci
=----------------------------------+-------------------------------
All syllogisms contain three lines | sha...@shanew.net
Therefore this is not a syllogism | www.ischool.utexas.edu/~shanew