Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-29 Thread Dieter Maurer
Lennart Regebro wrote at 2007-3-28 18:25 +0200:
On 3/27/07, Dieter Maurer [EMAIL PROTECTED] wrote:
 However, this approach is only efficient when the sort index size
 is small compared to the result size.

Sure. But with incremental searching, the result size is always one, right? ;-)

No. You want to have one (the best one) hit from a (potentially) large
set of hits. The size of this set is essential whether or not
a sort index is efficient.

The principle is like this:

  Let S be the hit set.
  Assume you sort index consists of values i1  i2   and
  use D(i) for the set of documents indexed under i,
  Then D(i1) intersect S preceeds D(i2) intersects S preceeds
  D(i3) intersects S, etc in the result ordered by the index i.

  If the index size is small compared to S (and if we assume the
  hits are uniformly distributed over the indexed values),
  then each intersection can determine (on average) a significant
  amount of sorted hits. You can efficiently determine the first hits.

  Assume on the other hand that S contains a single element
  and that the index is large, then almost all intersections are
  a vaste of time (as the result is the empty set).


-- 
Dieter
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Dieter Maurer
Jim Fulton wrote at 2007-3-26 15:55 -0400:
 ...

On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:

 Jim Fulton wrote at 2007-3-25 09:53 -0400:

 On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
 MF I think one of the main limitations of the current catalog (and
 MF hurry.query) is efficient support for sorting and batching the
 query
 MF results. The Zope 3 catalog returns all matching results, which
 can then
 MF be sorted and batched. This will stop being scalable for large
 MF collections. A relational database is able to do this
 internally, and is
 MF potentially able to use optimizations there.

 What evidence to you have to support this assertion?  We did some
 literature search on this a few years ago and found no special trick
 to avoid sorting costs.

 I know of 2 approaches to reducing sort cost:

 1. Sort your results based on the primary key and therefore, pick
 your primary key to match your sort results.  In terms of the Zope
 catalog framework, the primary keys are the document IDs, which are
 traditionally chosen randomly.  You can pick your primary keys based
 on a desired sort order instead. A variation on this theme is to use
 multiple sets of document ids,  storing multiple sets of ids in each
 index.  Of course, this approach doesn't help with something like
 relevance ranks.

 2. Use an N-best algorithm.  If N is the size of the batch and M is
 the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
 is a significant improvement if N  M, but still quite expensive.

 The major costs in sorting are usually not the log(n) but
 the very high linear costs fetching the sort keys (although for  
 huge n,
 we will reach the asymptotic limits).

Right. The problem is the N not the log(N). :)


 Under normal conditions, a relational database can be far more  
 efficient
 to fetch values either from index structures or the data records
 than Zope -- as

   * its data representation is much more compact

   * it often supports direct access

   * the server itself can access and process all data.


 With the ZODB, the data is hidden in pickles (less compact), there is
 no direct access (instead the complete pickle need to be decoded)

The catalog sort index mechanism uses the un-index data structure in  
the sort index to get sort keys. This is a pretty compact data  
structure.

The data usually is in IOBuckets which contain 45 values on the average.
In a corresponding relational index structure, you could have several hundreds
of values.

 and
 all operations are done in the client (rather than in the server).

Which is often fine if the desired data are in the client cache.  It  
avoids making the storage a bottleneck.

Yes, but the if is important. Quite often, some operations flush
almost all objects from the cache (partly because the cache is controlled
by the number of objects and not by their size) and after that,
filling the cache again takes ages.

Moreover, a relational database can (and usually does) use caching as
well. It is not restricted to a cliend side only technique.

I know that most relational database backends have a different architecture
than ZEO: use one process (or at least one thread) per connection
such that several activities can interleave the high IO wait times.
The one thread ZEO architecture must take more care not to become the
bottleneck.



-- 
Dieter
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-26 Thread Jim Fulton


On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:


Jim Fulton wrote at 2007-3-25 09:53 -0400:


On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:

MF I think one of the main limitations of the current catalog (and
MF hurry.query) is efficient support for sorting and batching the
query
MF results. The Zope 3 catalog returns all matching results, which
can then
MF be sorted and batched. This will stop being scalable for large
MF collections. A relational database is able to do this
internally, and is
MF potentially able to use optimizations there.


What evidence to you have to support this assertion?  We did some
literature search on this a few years ago and found no special trick
to avoid sorting costs.

I know of 2 approaches to reducing sort cost:

1. Sort your results based on the primary key and therefore, pick
your primary key to match your sort results.  In terms of the Zope
catalog framework, the primary keys are the document IDs, which are
traditionally chosen randomly.  You can pick your primary keys based
on a desired sort order instead. A variation on this theme is to use
multiple sets of document ids,  storing multiple sets of ids in each
index.  Of course, this approach doesn't help with something like
relevance ranks.

2. Use an N-best algorithm.  If N is the size of the batch and M is
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
is a significant improvement if N  M, but still quite expensive.


The major costs in sorting are usually not the log(n) but
the very high linear costs fetching the sort keys (although for  
huge n,

we will reach the asymptotic limits).


Right. The problem is the N not the log(N). :)


Under normal conditions, a relational database can be far more  
efficient

to fetch values either from index structures or the data records
than Zope -- as

  * its data representation is much more compact

  * it often supports direct access

  * the server itself can access and process all data.


With the ZODB, the data is hidden in pickles (less compact), there is
no direct access (instead the complete pickle need to be decoded)


The catalog sort index mechanism uses the un-index data structure in  
the sort index to get sort keys. This is a pretty compact data  
structure.



and
all operations are done in the client (rather than in the server).


Which is often fine if the desired data are in the client cache.  It  
avoids making the storage a bottleneck.


Jim

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-25 Thread Jim Fulton


On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:

MF I think one of the main limitations of the current catalog (and
MF hurry.query) is efficient support for sorting and batching the  
query
MF results. The Zope 3 catalog returns all matching results, which  
can then

MF be sorted and batched. This will stop being scalable for large
MF collections. A relational database is able to do this  
internally, and is

MF potentially able to use optimizations there.


What evidence to you have to support this assertion?  We did some  
literature search on this a few years ago and found no special trick  
to avoid sorting costs.


I know of 2 approaches to reducing sort cost:

1. Sort your results based on the primary key and therefore, pick  
your primary key to match your sort results.  In terms of the Zope  
catalog framework, the primary keys are the document IDs, which are  
traditionally chosen randomly.  You can pick your primary keys based  
on a desired sort order instead. A variation on this theme is to use  
multiple sets of document ids,  storing multiple sets of ids in each  
index.  Of course, this approach doesn't help with something like  
relevance ranks.


2. Use an N-best algorithm.  If N is the size of the batch and M is  
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which  
is a significant improvement if N  M, but still quite expensive.


I don't think relational databases have any magic bullet to get  
around sorting costs.  Sorting is expensive.  In many ways, I think  
the sorting support in the catalog gave people a false sense of  
security.


Jim

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-25 Thread Jim Fulton


On Mar 25, 2007, at 11:08 AM, Lennart Regebro wrote:
...

2. Use an N-best algorithm.  If N is the size of the batch and M is
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
is a significant improvement if N  M, but still quite expensive.

I don't think relational databases have any magic bullet to get
around sorting costs.  Sorting is expensive.  In many ways, I think
the sorting support in the catalog gave people a false sense of
security.


I don't know if relational databases typically does this internally (I
don't think so). However, some search engines do it, like Lucene. And
supposedly also Dieters IncrementalSearch (haven't used it yet).


Our catalog framework also has N-best support.

JIm

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev