Justin Mason wrote:
I'm curious, because we have been discussing the use of multiword tokens.
(I agree that the numbers make sense, but I'm wondering how much *worse*
the problem gets when you move from single-word to multi-word tokens).

I did my calculations for our stuff assuming single tokens. If you combine tokens the way CRM114 does, you generate 16 combined tokens for every one token that you parse.


Umm, I just thought of a flaw in my analysis... What CRM114 is doing is not the same as having 16 times as many tokens for the database, which is how I calculated their case, it's a _lot_ more impact than that.

Think of it this way: Lets pretend that a typical person's vocabulary in their native language consists of 30,000 words. A database of all words that they use would have 30,000 entries. Now look at all the one, two, three, four, and five word phrases that they use. The database of phrases potentially has 30,000 to the fifth power entries, but it will actually have much fewer than that. How many will it have? I don't know how to answer that question except empirically, by collecting phrases and counting them up. It's how many phrases does a person actually use, not how many can be formed by combining words willy-nilly.

CRM114 style combining of tokens generates 16 times as many combined tokens per message as single tokens. But that is not the same as saying that 16 times as many unique combined tokens go into the database. If the same single token appears near different tokens in each use, you can get an incredible explosion of the database. You would have to do something to limit what actually gets saved in the database.

I have to take a break from this right now, but I want to send this out in its unfinished state to bring up the point that the issue of combining tokens is more complicated than the CRM114 people are making it out to be. I don't think it makes sense to gloss over it by using a way too small hash function, ignoring the effect of collisions, and then using TOE to cover up the problems.

For now I think we are fine with single tokens and a 40 bit hash.

More analysis later...

 -- sidney



Reply via email to