-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sidney Markowitz writes:
> Justin Mason wrote:
> > The bogofilter/CRM-114 forward was pretty clear that collisions in
> > multiword token use caused FPs: 'the hash collisions quickly caused
> > outrageously bad classification mistakes'.
> 
> Yes, but they are using 4 bits less in the hash function and multiplying 
> their number of tokens by 16 by generating multiword tokens. And there 
> is an exponential effect. My numbers show us getting something like 16 
> collisions (32 tokens) out of 4 million using 40 bits.

Is that using multiword tokens?  Just wondering because you add the
qualification below...

> The same 
> calculations show on the order of two million colliding tokens when you 
> use a 32 bit hash on a multiword database generated from a 4 million 
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> single token base. No wonder they have a problem. They are ignoring the 
> proper use of hash functions.

I'm curious, because we have been discussing the use of multiword tokens.
(I agree that the numbers make sense, but I'm wondering how much *worse*
the problem gets when you move from single-word to multi-word tokens).

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAhcRmQTcbUG5Y7woRAlJjAJ9P+UjaBggSeo5nVMcb190QiudOFACgmHog
Zm0cESy4U2FOJxxBq7Du/XY=
=5vcp
-----END PGP SIGNATURE-----

Reply via email to