Re: a note on multiword hashed tokens and collisions (fwd)

Daniel Quinlan 21 Apr 2004 07:02:22 -0000

Sidney Markowitz <[EMAIL PROTECTED]> writes:

> Think of it this way: Lets pretend that a typical person's vocabulary in 
> their native language consists of 30,000 words. A database of all words 
> that they use would have 30,000 entries.


I've read that 30,000 words was the written vocabulary of Shakespeare.
A college graduate might know that many (or more?), but I doubt few
authors (let alone your average email) will use that many in their
writing.  Of course, we're dealing with more than one person here.

The number of combinations won't be that high either because we have
grammar and there are a lot more non-grammatical phrases than
grammatical phrases and also because the set of grammatical phrases is
much larger than the set of phrases actually used.

So, the empirical approach makes a huge amount of sense to me.  Trying
to figure this out any other way is going to be like plugging numbers
into the Drake equation.  Either you'll calculate zero collisions or
100,000 depending on which wild ass guess you use.  :-)

If you figure out where collisions start happening on average and then
you'll be able to pick a better number.  If we go multi-word, then the
number might need to be changed, of course, but the DB can be broken at
that point, I think.

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: a note on multiword hashed tokens and collisions (fwd)

Reply via email to