Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Lennart Regebro wrote at 2007-3-28 18:25 +0200: On 3/27/07, Dieter Maurer [EMAIL PROTECTED] wrote: However, this approach is only efficient when the sort index size is small compared to the result size. Sure. But with incremental searching, the result size is always one, right? ;-) No. You want to have one (the best one) hit from a (potentially) large set of hits. The size of this set is essential whether or not a sort index is efficient. The principle is like this: Let S be the hit set. Assume you sort index consists of values i1 i2 and use D(i) for the set of documents indexed under i, Then D(i1) intersect S preceeds D(i2) intersects S preceeds D(i3) intersects S, etc in the result ordered by the index i. If the index size is small compared to S (and if we assume the hits are uniformly distributed over the indexed values), then each intersection can determine (on average) a significant amount of sorted hits. You can efficiently determine the first hits. Assume on the other hand that S contains a single element and that the index is large, then almost all intersections are a vaste of time (as the result is the empty set). -- Dieter ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Jim Fulton wrote at 2007-3-26 15:55 -0400: ... On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote: Jim Fulton wrote at 2007-3-25 09:53 -0400: On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote: MF I think one of the main limitations of the current catalog (and MF hurry.query) is efficient support for sorting and batching the query MF results. The Zope 3 catalog returns all matching results, which can then MF be sorted and batched. This will stop being scalable for large MF collections. A relational database is able to do this internally, and is MF potentially able to use optimizations there. What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs. I know of 2 approaches to reducing sort cost: 1. Sort your results based on the primary key and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks. 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. The major costs in sorting are usually not the log(n) but the very high linear costs fetching the sort keys (although for huge n, we will reach the asymptotic limits). Right. The problem is the N not the log(N). :) Under normal conditions, a relational database can be far more efficient to fetch values either from index structures or the data records than Zope -- as * its data representation is much more compact * it often supports direct access * the server itself can access and process all data. With the ZODB, the data is hidden in pickles (less compact), there is no direct access (instead the complete pickle need to be decoded) The catalog sort index mechanism uses the un-index data structure in the sort index to get sort keys. This is a pretty compact data structure. The data usually is in IOBuckets which contain 45 values on the average. In a corresponding relational index structure, you could have several hundreds of values. and all operations are done in the client (rather than in the server). Which is often fine if the desired data are in the client cache. It avoids making the storage a bottleneck. Yes, but the if is important. Quite often, some operations flush almost all objects from the cache (partly because the cache is controlled by the number of objects and not by their size) and after that, filling the cache again takes ages. Moreover, a relational database can (and usually does) use caching as well. It is not restricted to a cliend side only technique. I know that most relational database backends have a different architecture than ZEO: use one process (or at least one thread) per connection such that several activities can interleave the high IO wait times. The one thread ZEO architecture must take more care not to become the bottleneck. -- Dieter ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote: Jim Fulton wrote at 2007-3-25 09:53 -0400: On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote: MF I think one of the main limitations of the current catalog (and MF hurry.query) is efficient support for sorting and batching the query MF results. The Zope 3 catalog returns all matching results, which can then MF be sorted and batched. This will stop being scalable for large MF collections. A relational database is able to do this internally, and is MF potentially able to use optimizations there. What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs. I know of 2 approaches to reducing sort cost: 1. Sort your results based on the primary key and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks. 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. The major costs in sorting are usually not the log(n) but the very high linear costs fetching the sort keys (although for huge n, we will reach the asymptotic limits). Right. The problem is the N not the log(N). :) Under normal conditions, a relational database can be far more efficient to fetch values either from index structures or the data records than Zope -- as * its data representation is much more compact * it often supports direct access * the server itself can access and process all data. With the ZODB, the data is hidden in pickles (less compact), there is no direct access (instead the complete pickle need to be decoded) The catalog sort index mechanism uses the un-index data structure in the sort index to get sort keys. This is a pretty compact data structure. and all operations are done in the client (rather than in the server). Which is often fine if the desired data are in the client cache. It avoids making the storage a bottleneck. Jim -- Jim Fulton mailto:[EMAIL PROTECTED]Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporationhttp://www.zope.com http://www.zope.org ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote: MF I think one of the main limitations of the current catalog (and MF hurry.query) is efficient support for sorting and batching the query MF results. The Zope 3 catalog returns all matching results, which can then MF be sorted and batched. This will stop being scalable for large MF collections. A relational database is able to do this internally, and is MF potentially able to use optimizations there. What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs. I know of 2 approaches to reducing sort cost: 1. Sort your results based on the primary key and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks. 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. I don't think relational databases have any magic bullet to get around sorting costs. Sorting is expensive. In many ways, I think the sorting support in the catalog gave people a false sense of security. Jim -- Jim Fulton mailto:[EMAIL PROTECTED]Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporationhttp://www.zope.com http://www.zope.org ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On Mar 25, 2007, at 11:08 AM, Lennart Regebro wrote: ... 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. I don't think relational databases have any magic bullet to get around sorting costs. Sorting is expensive. In many ways, I think the sorting support in the catalog gave people a false sense of security. I don't know if relational databases typically does this internally (I don't think so). However, some search engines do it, like Lucene. And supposedly also Dieters IncrementalSearch (haven't used it yet). Our catalog framework also has N-best support. JIm -- Jim Fulton mailto:[EMAIL PROTECTED]Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporationhttp://www.zope.com http://www.zope.org ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev