Michael Parker wrote:
Question is, is using that value gonna work in the long run
for dbs with 3-4 million tokens?

substr(sha1($token), -5) and CHAR(5) is good to about 2 million using my no-brainer criteria of expecting no collisions at all.


Ok, time for me to do some math for the non-no-brainer case :-)

[looking things up... writing simulations in Lisp... ok, done]

When you get to 3 million tokens in the db you can expect 1 or 2 collisions. At 4 million, 8 to 16 collisions. A collision means that two tokens which are different are treated as the same token.

So worse case, with a db of 4 million tokens, 32 tokens, grouped in 16 pairs, are being classified incorrectly. This can only make a difference if one token of a colliding pair is a significant ham or spam sign and the other is either not significant or is significant in the opposite way.

If you pick 150 tokens at random to evaluate a message, the probability of choosing one of the colliding tokens is 150 * (32 / 4000000) = 0.0012

That means that you have a 1 in a thousand chance of having one token out of the 150 contributing an inaccurate number to the calculation of the Bayes score of a message. 1 in a million that two of them will be wrong. Except that the probability is even lower, since the 150 most significant tokens are selected and there is a low probability that any of the 32 colliding tokens are ones that have a high significance.

Now if that chance of collision is too high for you or you want to stick to no-brainer numbers, you only need two more bits in the hash to double the no-brainer limit of 2 million up to 4 million. Of course we don't have two bits available without going up another whole byte, which makes it:

substr(sha1($token), -6) and CHAR(6) is no-brainer good to about 32 million tokens.

Can you try your benchmark with that? I expect that it would get you the same performance as the 40 bit version and still enough reduction in db size to be worthwhile.

But I think that my numbers show that 40-bit should be ok at 4 million and certainly at 3 million.

-- sidney

Reply via email to