Re: [Zope-dev] KeywordIndex performance / multiunion
Tim Peters wrote: [Seb Bacon] But main the reason I'm posting is to wonder if there any reason not to use the multiunion operator instead of the union operator in UnIndex.py... it should be faster, right? It seems a touch faster in some informal tests. [Casey Duncan] Yes, it probably should be used. I think it is much faster when unioning very small and very large sets as well. Tim would know better, tho. time.clock() knows best wink. I asked time.clock(), and it says the difference between union and multiunion is not statistically significant in my scenario (7 sets of 27k catalog data record ids, with about 25k in the intersection of all 7). So I'm not going to bother changing this in UnIndex. Seb ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] KeywordIndex performance / multiunion
Casey Duncan wrote: On Thu, 06 Nov 2003 19:11:55 + Seb Bacon [EMAIL PROTECTED] wrote: A simple query for [A or B or C] against a KeywordIndex containing 27k objects is taking about 7 seconds on a Celeron 1.6Ghz, which seems an absurdly long time to me. guess This time may be caused by fetching from the database. If so, then the only way to speed it up is increase the ZODB cache or get faster disks. Try the former and see if it helps. /guess Yup, absolutely right. Upping the cache speeds it up to something sane. However, I don't understand why. The code does something like: set1 = self.index.get(1) set2 = self.index.get(2) sets = [set1, set2] ...so the sets will have come from the ZODB. But the bit which takes the time is the following line: result = multiunion(sets) At which point the sets have already been fetched, no? looking forward to the day I understand ZODB caches...;-) seb ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] KeywordIndex performance / multiunion
On Fri, 07 Nov 2003 12:02:07 + Seb Bacon [EMAIL PROTECTED] wrote: Casey Duncan wrote: On Thu, 06 Nov 2003 19:11:55 + Seb Bacon [EMAIL PROTECTED] wrote: A simple query for [A or B or C] against a KeywordIndex containing 27k objects is taking about 7 seconds on a Celeron 1.6Ghz, which seems an absurdly long time to me. guess This time may be caused by fetching from the database. If so, then the only way to speed it up is increase the ZODB cache or get faster disks. Try the former and see if it helps. /guess Yup, absolutely right. Upping the cache speeds it up to something sane. However, I don't understand why. The code does something like: set1 = self.index.get(1) set2 = self.index.get(2) sets = [set1, set2] ...so the sets will have come from the ZODB. But the bit which takes the time is the following line: These are TreeSets most likely. The actual members of the sets are stored in separate persistent objects. This is done so that large sets can be fetched in chunks rather than all at once. The ZODB tries to be lazy with fetching objects. If an object is very large it often makes sense to split it up between many persistent objects so that each part can be loaded and unloaded separately. This is what BTrees and TreeSets do. When you fetch a value from a BTree or test for an element in a set it only needs to load the part (in this case a Bucket) that contains the element, rather than the whole enchilada which could be huge (in reality its not quite that simple, but that's the general idea). This is why BTrees and TreeSets are used when you need to store and manipulate arbitarily large numbers of elements. result = multiunion(sets) This iterates the sets which loads all the buckets from the database if they are not in the cache. At which point the sets have already been fetched, no? looking forward to the day I understand ZODB caches...;-) Actually this is not so much a function of the cache but a function of the organization of the set objects themselves. -Casey ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] KeywordIndex performance / multiunion
Casey Duncan wrote: guess This time may be caused by fetching from the database. If so, then the only way to speed it up is increase the ZODB cache or get faster disks. Try the former and see if it helps. /guess Upping the cache speeds it up to something sane. However, I don't understand why. These are TreeSets most likely. The actual members of the sets are stored in separate persistent objects. But of course :-) Thanks, seb ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] KeywordIndex performance / multiunion
On Thu, 06 Nov 2003 19:11:55 + Seb Bacon [EMAIL PROTECTED] wrote: A simple query for [A or B or C] against a KeywordIndex containing 27k objects is taking about 7 seconds on a Celeron 1.6Ghz, which seems an absurdly long time to me. guess This time may be caused by fetching from the database. If so, then the only way to speed it up is increase the ZODB cache or get faster disks. Try the former and see if it helps. /guess The bit of the operation that is taking the time is the part of the index which does a union of the results of each of the 3 query parameters. Reproducing this BTree union operation outside Zope in a unit test takes milliseconds. Very occasionally it takes milliseconds inside the KeywordIndex too. Something to do with memory usage perhaps? I'll continue looking into it. Are you saying that sometimes it's faster? If so, that probably supports my guess above. In those times it's all or mostly in memory. But main the reason I'm posting is to wonder if there any reason not to use the multiunion operator instead of the union operator in UnIndex.py... it should be faster, right? It seems a touch faster in some informal tests. Yes, it probably should be used. I think it is much faster when unioning very small and very large sets as well. Tim would know better, tho. -Casey ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
RE: [Zope-dev] KeywordIndex performance / multiunion
[Seb Bacon] But main the reason I'm posting is to wonder if there any reason not to use the multiunion operator instead of the union operator in UnIndex.py... it should be faster, right? It seems a touch faster in some informal tests. [Casey Duncan] Yes, it probably should be used. I think it is much faster when unioning very small and very large sets as well. Tim would know better, tho. time.clock() knows best wink. multiunion() has a systemic algorithmic advantage when applied to at least 3 input sets: its runtime is proportional to the sum of the sizes of the input sets. A plain union of two sets also has runtime proportional to the sum of the sizes of its input sets, but if all you've got is plain two-argument union, it's not possible to unite N sets in time proportional to the sum of their sizes. You have to pick some pair to unite first, and at worst that result has a size equal to the sum of the sizes of its inputs, and tnat larger intermediate result has to feed in again to a later union (etc). OTOH, two-argument union crawls over its input sets once each. multiunion() crawls over each of its inputs essentially six or seven times each (once to extract the values, once for each of 4 passes of a bytewise radix sort, once over the sorted result to weed out duplicates, and once again over the duplicate-free sorted result to stuff the integers back into a set object). That implies the relative timing results can be data-dependent -- and they are. The more input sets you have, though, the harder it is for a chain of two-input unions to overcome multiunion's ever-growing algorithmic advantage. Of course, if the data isn't already in memory, none of that matters unless you've got many sets to unite (data fetch time will dominate). ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )