So worth noting -- CRM-114 has had to adopt special training strategies to avoid collisions when using hashed multiword tokens.
I looked up the details of CRM-114 tokenization and hashing. They generate 16 multiword tokens per parsed token (as it said in the piece you quoted) and the hash is only 32 bits. Birthday paradox says to expect collisions starting at 2 times the square root of the hash space, which means once you have 2^17 multiword tokens. That roughly compares to having 1/16 that many single tokens in a database, or 2^13.
Think what it would be like if we ran into collisions once we hit a little over 8,000 distinct tokens in the database. Of course CRM114 has problems with it.
This isn't rocket science. I don't know what those people are thinking. If you want to use a hash with low probability of collisions, you use enough bits in the hash function. A good rule of thumb is at least twice as many bits in the hash function as in the number of items that you are hashing. So the 40 bit hash I was using allows for a database of around a million unique tokens. It's my understanding that a large Bayes db would be under half a million. If we go to multiword hashes like CRM114 that multiply the number of tokens by 16 we should add eight bits to the hash function size. So the db would go from 5n bytes required for the tokens to (n * 16) * 6 = 96n, so would be 96/5 or almost 20 times bigger. Something to keep in mind if we consider multiword tokens.
-- sidney
