Re: [Zope-dev] KeywordIndex performance / multiunion

2003-11-07 Thread Seb Bacon
Tim Peters wrote:
[Seb Bacon]

But main the reason I'm posting is to wonder if there any reason not
to use the multiunion operator instead of the union operator in
UnIndex.py... it should be faster, right?  It seems a touch faster in
some informal tests.


[Casey Duncan]

Yes, it probably should be used. I think it is much faster when
unioning very small and very large sets as well. Tim would know
better, tho.


time.clock() knows best wink.
I asked time.clock(), and it says the difference between union and 
multiunion is not statistically significant in my scenario (7 sets of 
27k catalog data record ids, with about 25k in the intersection of all 7).

So I'm not going to bother changing this in UnIndex.

Seb



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] KeywordIndex performance / multiunion

2003-11-07 Thread Seb Bacon
Casey Duncan wrote:
On Thu, 06 Nov 2003 19:11:55 +
Seb Bacon [EMAIL PROTECTED] wrote:
A simple query for [A or B or C] against a KeywordIndex containing 
27k objects is taking about 7 seconds on a Celeron 1.6Ghz, which seems 
an absurdly long time to me.
guess
This time may be caused by fetching from the database. If so, then the
 only way to speed it up is increase the ZODB cache or get faster disks.
 Try the former and see if it helps. /guess
Yup, absolutely right.  Upping the cache speeds it up to something sane. 
 However, I don't understand why.  The code does something like:

 set1 = self.index.get(1)
 set2 = self.index.get(2)
 sets = [set1, set2]
...so the sets will have come from the ZODB.  But the bit which takes 
the time is the following line:

 result = multiunion(sets)

At which point the sets have already been fetched, no?

looking forward to the day I understand ZODB caches...;-)

seb



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] KeywordIndex performance / multiunion

2003-11-07 Thread Casey Duncan
On Fri, 07 Nov 2003 12:02:07 +
Seb Bacon [EMAIL PROTECTED] wrote:

 Casey Duncan wrote:
  On Thu, 06 Nov 2003 19:11:55 +
  Seb Bacon [EMAIL PROTECTED] wrote:
 A simple query for [A or B or C] against a KeywordIndex containing 
 27k objects is taking about 7 seconds on a Celeron 1.6Ghz, which seems 
 an absurdly long time to me.
  
  guess
  This time may be caused by fetching from the database. If so, then the
   only way to speed it up is increase the ZODB cache or get faster disks.
   Try the former and see if it helps. /guess
 
 Yup, absolutely right.  Upping the cache speeds it up to something sane. 
   However, I don't understand why.  The code does something like:
 
   set1 = self.index.get(1)
   set2 = self.index.get(2)
   sets = [set1, set2]
 
 ...so the sets will have come from the ZODB.  But the bit which takes 
 the time is the following line:

These are TreeSets most likely. The actual members of the sets are stored in separate 
persistent objects. This is done so that large sets can be fetched in chunks rather 
than all at once.

The ZODB tries to be lazy with fetching objects. If an object is very large it often 
makes sense to split it up between many persistent objects so that each part can be 
loaded and unloaded separately. This is what BTrees and TreeSets do. When you fetch a 
value from a BTree or test for an element in a set it only needs to load the part (in 
this case a Bucket) that contains the element, rather than the whole enchilada which 
could be huge (in reality its not quite that simple, but that's the general idea).

This is why BTrees and TreeSets are used when you need to store and manipulate 
arbitarily large numbers of elements.

   result = multiunion(sets)

This iterates the sets which loads all the buckets from the database if they are not 
in the cache.
 
 At which point the sets have already been fetched, no?
 
 looking forward to the day I understand ZODB caches...;-)

Actually this is not so much a function of the cache but a function of the 
organization of the set objects themselves.

-Casey

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] KeywordIndex performance / multiunion

2003-11-07 Thread Seb Bacon
Casey Duncan wrote:
guess
This time may be caused by fetching from the database. If so, then the
 only way to speed it up is increase the ZODB cache or get faster disks.
 Try the former and see if it helps. /guess
 Upping the cache speeds it up to something sane. 
 However, I don't understand why. 
These are TreeSets most likely. The actual members of the sets are stored in 
 separate persistent objects.

But of course :-)

Thanks,

seb

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] KeywordIndex performance / multiunion

2003-11-06 Thread Casey Duncan
On Thu, 06 Nov 2003 19:11:55 +
Seb Bacon [EMAIL PROTECTED] wrote:

 A simple query for [A or B or C] against a KeywordIndex containing 
 27k objects is taking about 7 seconds on a Celeron 1.6Ghz, which seems 
 an absurdly long time to me.

guess
This time may be caused by fetching from the database. If so, then the only way to 
speed it up is increase the ZODB cache or get faster disks. Try the former and see if 
it helps.
/guess
 
 The bit of the operation that is taking the time is the part of the 
 index which does a union of the results of each of the 3 query 
 parameters.  Reproducing this BTree union operation outside Zope in a 
 unit test takes milliseconds.  Very occasionally it takes milliseconds 
 inside the KeywordIndex too.  Something to do with memory usage perhaps? 
   I'll continue looking into it.

Are you saying that sometimes it's faster? If so, that probably supports my guess 
above. In those times it's all or mostly in memory.
 
 But main the reason I'm posting is to wonder if there any reason not to 
 use the multiunion operator instead of the union operator in 
 UnIndex.py... it should be faster, right?  It seems a touch faster in 
 some informal tests.

Yes, it probably should be used. I think it is much faster when unioning very small 
and very large sets as well. Tim would know better, tho.

-Casey

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


RE: [Zope-dev] KeywordIndex performance / multiunion

2003-11-06 Thread Tim Peters
[Seb Bacon]
 But main the reason I'm posting is to wonder if there any reason not
 to use the multiunion operator instead of the union operator in
 UnIndex.py... it should be faster, right?  It seems a touch faster in
 some informal tests.

[Casey Duncan]
 Yes, it probably should be used. I think it is much faster when
 unioning very small and very large sets as well. Tim would know
 better, tho.

time.clock() knows best wink.  multiunion() has a systemic algorithmic
advantage when applied to at least 3 input sets:  its runtime is
proportional to the sum of the sizes of the input sets.  A plain union of
two sets also has runtime proportional to the sum of the sizes of its input
sets, but if all you've got is plain two-argument union, it's not possible
to unite N sets in time proportional to the sum of their sizes.  You have to
pick some pair to unite first, and at worst that result has a size equal to
the sum of the sizes of its inputs, and tnat larger intermediate result has
to feed in again to a later union (etc).

OTOH, two-argument union crawls over its input sets once each.  multiunion()
crawls over each of its inputs essentially six or seven times each (once to
extract the values, once for each of 4 passes of a bytewise radix sort, once
over the sorted result to weed out duplicates, and once again over the
duplicate-free sorted result to stuff the integers back into a set object).

That implies the relative timing results can be data-dependent -- and they
are.  The more input sets you have, though, the harder it is for a chain of
two-input unions to overcome multiunion's ever-growing algorithmic
advantage.

Of course, if the data isn't already in memory, none of that matters unless
you've got many sets to unite (data fetch time will dominate).


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )