Daniel Quinlan wrote:
I've read that 30,000 words was the written
vocabulary of Shakespeare.

Yes, I picked a number vaguely in the right range because what I said next shows that the exact number doesn't matter. I just wanted something concrete for an example.


So, the empirical approach makes a huge amount of sense to me. Trying
to figure this out any other way is going to be like plugging numbers
into the Drake equation

That was the point I was trying to make. Any reasonable number you pick for an average vocabulary raised to the fifth power is too big, so we can't use the combinatorial maximum as a useful upper bound. We would have to use the empirical result by looking at real corpa.


There is a problem with that, though. If we count on there being a limited number of phrases that show up in email, a much smaller number than the number of random combinations of our vocabulary, then we are vulnerable to spammers inserting random nonsense combinations of words in spam to inflate the database. These random phrases would be hapaxes, but how would we know to ignore them without recording them to find out they do only appear once?

If you figure out where collisions start happening on average and then
you'll be able to pick a better number. If we go multi-word, then the

Oh, I think the point where collisions start is clear: They start with approximately 1 collision when the number of unique items is twice the square root of the hash function size, and they increase by a factor of 4 with every additional bit in the number of unique items. We already have a good feel for how many items we can expect in the db when using single tokens. It's empirical data for the n-gram token case that we don't have.


-- sidney

Reply via email to