Justin Mason wrote:
The bogofilter/CRM-114 forward was pretty clear that collisions in
multiword token use caused FPs: 'the hash collisions quickly caused
outrageously bad classification mistakes'.

Yes, but they are using 4 bits less in the hash function and multiplying their number of tokens by 16 by generating multiword tokens. And there is an exponential effect. My numbers show us getting something like 16 collisions (32 tokens) out of 4 million using 40 bits. The same calculations show on the order of two million colliding tokens when you use a 32 bit hash on a multiword database generated from a 4 million single token base. No wonder they have a problem. They are ignoring the proper use of hash functions.


 -- sidney



Reply via email to