First, is there an open bug on Bugzilla that anyone would like to see this discussion moved to? If not I think I or whoever gets there first should open one so this is easier to track.
Second, I ran a test last night in which I moved the hashing of tokens from SQL.pm to Bayes.pm. To simplify things, I changed from using a BIGINT of the low order 60 bits of the SHA1 to a CHAR(8) of the Base64 encoding of the low order 42 bits. That should take the same amount of space, still provides the benefits of a fixed field, and keeps enough hash bits to allow about 4 million unique tokens before there are any collisions expected. I only ran one test, which also had the SELECT ... token IN (...) optimization. It ran in 17 minutes, which is the same time that I got with the BIGINT field without the IN, and as compared to the 14 minutes with BIGINT and IN, and 25 minutes before I made any changes at all.
So my question is whether having token be a BIGINT could be that much faster than having it CHAR(8), or whether I could expect that much variability between test runs and the difference between 17 and 14 minutes is not that significant without running these tests many more times?
I'll check that out empirically when I have time, but that will take a while and maybe you already know the answer.
-- sidney
