On Tue, Apr 22, 2008 at 04:19:57AM -0400, Daniel Veillard wrote: > On Sun, Apr 20, 2008 at 10:51:38AM +0200, Stefan Behnel wrote: > > Hi again (and sorry for all the noise),
Sorry for getting back to this issue. > > numbers, it works pretty well over a wide range of vocabulary sizes from > > small > > to medium, and as I've shown before, it performs much (much!) better for > > larger sizes. > > > > BTW, Bob Jenkins did a comparison of a couple of hash functions, including > > the > > additive hash (a variant of which is currently used) and the hash function > > used in the patch. > > > > http://burtleburtle.net/bob/hash/doobs.html > > > > The hash function itself was written by Paul Hsieh and published on his web > > site. According to Bob Jenkins, it's public domain (although I didn't ask > > directly). > > > > http://www.azillionmonkeys.com/qed/hash.html > > > > Any objections to getting this patch merged? > > None, looks very good, thanks a lot for assembling all of this and providing > full and convincing numbers :-) > > I will apply this to SVN head now, After looking at some weird regression failures raised by libxslt, I'm afraid I will have to revert this patch at least temporary until we fix 2 problems I found recently: - the first one is a bug in the change, and affects only the case where you use subdictionaries, in the 3 lookup functions we reused the same okey when accessing the subdictionary, but both dictionaries may use different functions depending on their side. So sometimes the key value need to be recomputed when looking for the string in the subdict. the fix is by doing the recomputation if needed as in: if (dict->subdict) { - key = okey % dict->subdict->size; + unsigned long skey; + + /* we cannot always reuse the same okey for the subdict */ + if (((dict->size == MIN_DICT_SIZE) && + (dict->subdict->size != MIN_DICT_SIZE)) || + ((dict->size != MIN_DICT_SIZE) && + (dict->subdict->size == MIN_DICT_SIZE))) + skey = xmlDictComputeKey(dict->subdict, name, len); + else + skey = okey; + + key = skey % dict->subdict->size; - the second one is unfortunately not fixeable as is it comes from the key hash definitions themselves: -#define xmlDictComputeKey(dict, name, len) \ - (((dict)->size == MIN_DICT_SIZE) ? \ - xmlDictComputeFastKey(name, len) : \ - xmlDictComputeBigKey(name, len, len)) \ - -#define xmlDictComputeQKey(dict, prefix, name, len) \ - (((prefix) == NULL) ? \ - (xmlDictComputeKey(dict, name, len)) : \ - (((dict)->size == MIN_DICT_SIZE) ? \ - xmlDictComputeFastQKey(prefix, name, len) : \ - xmlDictComputeBigKey(name, len, len)))) the problem is that basically if you compute the key for a QName as "a:b" you can get 2 different answers, one if you accessed it using "a:b" directly and hence xmlDictComputeKey() or if using "a" prefix and "b" name, given the algorithm the key are not the same, and it is a key property of the dictionary to always return the same exact pointer for the same string. This breaks that property. The old algorithm allows to make that computation without building the assembled full string. In the new one that's harder (doable but harder) I'm afraid there would be quite a bit of nastyness implementing this as the new algorithm really like to access things 4bytes by 4bytes, we risk unaligned accesses, and a very complex algorithm to build this. I will try to look at this again tomorrow but that's not trivial. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
