Theo Van Dinter <[EMAIL PROTECTED]> writes: > Well, ok, but I was talking about using hash tokens in the code we > have now. For 3.0, we're not going to be replacing DB_File, and we're > not going to write our own DB module
Perhaps with DB_File, and perhaps not--there are other options like SQL. In addition, Michael has been experimenting with QDBM which we could easily use in this way. > (frankly I don't think we should do that at all...) In addition, using hash tokens is more of a requirement due to size reasons if we do multiple token stuff like CRM114 or DSPAM. > BTW: I did a little more testing... Took my 440k token bayes db and > ran through it using DB_File in a while(...= each ...) loop. Took 11.4 > seconds. I then converted the DB to use crc64 hashed keys instead, > but everything else exactly the same. Then ran through the read-only > loop from up above. 11.25 seconds. > > So if we combined the read time decrease with the CPU time increase > from the hashing function, we end up taking an extra 0.2 seconds, This is in a simple benchmark, it could still be much better or much worse and you're still neglecting the major disk space benefits for a 32bit key. > so it's still not worthwhile given the current code. That seems like a premature conclusion, although if you want to conclude we cannot simply slap hashing into our code, then I agree *that* would not be worth it. I think the likely lack of significant overhead shows that this idea is still quite worthy of serious investigation. Daniel -- Daniel Quinlan anti-spam (SpamAssassin), Linux, http://www.pathname.com/~quinlan/ and open source consulting
