I suspect if you used the first 32 bits or first 64 bits of the SHA1 you'd get equally good (perhaps better vs. CRC32) collision rates with the same size.
Some theoretical results: The Birthday Paradox says that you can expect to look at about 2 billion (2^31) hashes before you see any collision with a good random 32 bit hash. So you should not see the 14 collisions that are shown for CRC32 if you used a good cryptographic hash, such as something based on cooking MD5 or SHA-1 down to 32 bits.
Since you are not going to have more than two billion unique tokens in one Bayes database, and even if you did one collision out of that many will not be disastrous, it seems to me that we could use a 32 bit hash if it was a good one. A fixed four bytes per token should really decrease storage requirements and increase speed.
There is no reason to use the slower SHA-1 instead of MD5 when you are not concerned about cryptographic strength. We just want a collision resistant hash. Truncating MD5 to 64 or 32 bits should be fine. There are also the functions at http://burtleburtle.net/bob/hash/ which may be just as good in terms of a non-cryptographic 32 bit hash and might be faster than MD5.
In any case, if we can accept a collision rate of less than one in two billion, we should test using a good 32 bit hash that is not CRC32, rather than just a 64 bit hash.
-- sidney
