Jim Fulton wrote at 2007-3-26 15:55 -0400:
>On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:
>> Jim Fulton wrote at 2007-3-25 09:53 -0400:
>>> On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
>>>> MF> I think one of the main limitations of the current catalog (and
>>>> MF> hurry.query) is efficient support for sorting and batching the
>>>> MF> results. The Zope 3 catalog returns all matching results, which
>>>> can then
>>>> MF> be sorted and batched. This will stop being scalable for large
>>>> MF> collections. A relational database is able to do this
>>>> internally, and is
>>>> MF> potentially able to use optimizations there.
>>> What evidence to you have to support this assertion? We did some
>>> literature search on this a few years ago and found no special trick
>>> to avoid sorting costs.
>>> I know of 2 approaches to reducing sort cost:
>>> 1. Sort your results based on the "primary key" and therefore, pick
>>> your primary key to match your sort results. In terms of the Zope
>>> catalog framework, the primary keys are the document IDs, which are
>>> traditionally chosen randomly. You can pick your primary keys based
>>> on a desired sort order instead. A variation on this theme is to use
>>> multiple sets of document ids, storing multiple sets of ids in each
>>> index. Of course, this approach doesn't help with something like
>>> relevance ranks.
>>> 2. Use an N-best algorithm. If N is the size of the batch and M is
>>> the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
>>> is a significant improvement if N << M, but still quite expensive.
>> The major costs in sorting are usually not the "log(n)" but
>> the very high linear costs fetching the sort keys (although for
>> huge n,
>> we will reach the asymptotic limits).
>Right. The problem is the "N" not the log(N). :)
>> Under normal conditions, a relational database can be far more
>> to fetch values either from index structures or the data records
>> than Zope -- as
>> * its data representation is much more compact
>> * it often supports direct access
>> * the server itself can access and process all data.
>> With the ZODB, the data is hidden in pickles (less compact), there is
>> no direct access (instead the complete pickle need to be decoded)
>The catalog sort index mechanism uses the un-index data structure in
>the sort index to get sort keys. This is a pretty compact data
The data usually is in IOBuckets which contain 45 values on the average.
In a corresponding relational index structure, you could have several hundreds
>> all operations are done in the client (rather than in the server).
>Which is often fine if the desired data are in the client cache. It
>avoids making the storage a bottleneck.
Yes, but the "if" is important. Quite often, some operations flush
almost all objects from the cache (partly because the cache is controlled
by the number of objects and not by their size) and after that,
filling the cache again takes ages.
Moreover, a relational database can (and usually does) use caching as
well. It is not restricted to a cliend side only technique.
I know that most relational database backends have a different architecture
than ZEO: use one process (or at least one thread) per connection
such that several activities can interleave the high IO wait times.
The one thread ZEO architecture must take more care not to become the
Zope3-dev mailing list