On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
MF> I think one of the main limitations of the current catalog (and
MF> hurry.query) is efficient support for sorting and batching the query MF> results. The Zope 3 catalog returns all matching results, which can then
MF> be sorted and batched. This will stop being scalable for large
MF> collections. A relational database is able to do this internally, and is
MF> potentially able to use optimizations there.

What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs.

I know of 2 approaches to reducing sort cost:

1. Sort your results based on the "primary key" and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks.

2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N << M, but still quite expensive.

I don't think relational databases have any magic bullet to get around sorting costs. Sorting is expensive. In many ways, I think the sorting support in the catalog gave people a false sense of security.


Jim Fulton                      mailto:[EMAIL PROTECTED]                Python 
CTO                             (540) 361-1714                  
Zope Corporation        http://www.zope.com             http://www.zope.org

For more information about ZODB, see the ZODB Wiki:

ZODB-Dev mailing list  -  ZODB-Dev@zope.org

Reply via email to