Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-30 Thread Jim Washington
Dieter Maurer wrote:
 Jim Washington wrote at 2007-3-27 16:28 -0400:
   
 ...
 Yes, I think so, at least in the implementation/algorithm I am using. 
 There may be other implementations that do not need this.  Note,
 however, that the canonical list does not have to be complex objects. 
 The canonical list is just a representation of the unsorted state.  It
 can, for example, be a proxy list of iids, list indexes, or OOBTree
 keys.  The algorithm does not care what it is reordering.
 

 If you need to keep the canonical list around, then sort them
 and then keep the sorted result around (i.e. cache the sorted list).

 This way, you could avoid the factoradic index.


   
Yes, if there is only one ordering that will be used.  Factoradic
indices may be obtained and used for any arbitrary ordering of the
canonical list.  It may be useful, for example, to keep an index that
has books sorted by title, another one sorted by author and title, and
another one in publication date order.

- Jim Washington
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-29 Thread Dieter Maurer
Lennart Regebro wrote at 2007-3-28 18:25 +0200:
On 3/27/07, Dieter Maurer [EMAIL PROTECTED] wrote:
 However, this approach is only efficient when the sort index size
 is small compared to the result size.

Sure. But with incremental searching, the result size is always one, right? ;-)

No. You want to have one (the best one) hit from a (potentially) large
set of hits. The size of this set is essential whether or not
a sort index is efficient.

The principle is like this:

  Let S be the hit set.
  Assume you sort index consists of values i1  i2   and
  use D(i) for the set of documents indexed under i,
  Then D(i1) intersect S preceeds D(i2) intersects S preceeds
  D(i3) intersects S, etc in the result ordered by the index i.

  If the index size is small compared to S (and if we assume the
  hits are uniformly distributed over the indexed values),
  then each intersection can determine (on average) a significant
  amount of sorted hits. You can efficiently determine the first hits.

  Assume on the other hand that S contains a single element
  and that the index is large, then almost all intersections are
  a vaste of time (as the result is the empty set).


-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Lennart Regebro

On 3/26/07, Dieter Maurer [EMAIL PROTECTED] wrote:

When IncrementalSearch does it, it works roughly as Jim described it
under 1 above (although there is no need that it works the the primary
key directly).


Right.


Unless the indexes are able to return sorted (partial results), we
need to determine the results first and then sort them. And that
takes time at least linear in the number of hits (to determine the sort
values for the documents).


OK, at least this avoids the big intermediate results when searching
over several indexes. But you still have to get all of the results,
and sort them before you can return the X first. I have the impression
that Lucene somehow solves this with their sorting indexes, but I'm
not sure, and I haven't tried to understand the code.

--
Lennart Regebro: Zope and Plone consulting.
http://www.colliberty.com/
+33 661 58 14 64
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Jim Fulton


On Mar 26, 2007, at 8:45 PM, Martijn Faassen wrote:
...

I think we can safely conclude that:

* there is no silver bullet in all this (your point)


Yes


* there is probably room for improvement.


Yes

...


My hopes:

* is that there is some low-hanging fruit in improving things.


Possibly, including through education.

...


* that we may be able to provide some more infrastructure to help
developers in scaling particular catalog usage scenarios (batching
with sorting).


Possibly, although if this is really a goal, it will require:

- Educating people on how hard this is and

- Providing infrastructure that let's people record use alternate  
document ids or supplementary document ids.  (Not even thinking about  
relevance ranks.)  I suspect that such an infrastructure will be too  
hard to use for many people.


I still get the impression that you think that batching is more of an  
opportunity than it really is.  Yes, it does reduce time  
significantly, but still not enough to achieve scalability.  (It  
really only lets you deal with somewhat larger small result sets.)



I argue these points as you initially gave me the strong impression
saying that there's no point in even talking about all this.


I never said any such thing, In fact, I spent quite a bit of effort  
talking about it.



After
that I thought we were actually going somewhere with this discussion,
but you now strengthen this impression by apparently giving up in
exasparation. That is what is making *me* slightly exasparated. :)


Fine. I gave up because you dismiss literature and theory. You seem  
to imply that in practice there is some kind of magic that will  
somehow make sorts go fast despite any theoretical basis. This  
convinced me that further discussion was a waste of time.


Jim

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Jim Washington
Hi, Martijn

I have a suggestion, only because I have played around with the idea a bit.

Google for python factoradics, and you will get my blog entry about
factoradics.

I see the problem statement as How to obtain batching without
re-sorting multiple times.

If you see a sort order as one permutation of a list, the factoradic
technique provides a key to that permutation.  So, in theory, one would
sort the list, and store the factoradic index for that permutation.  The
next time the particular sort order is requested, it's a simple matter
of unpacking the factoradic. I have not done any tests to see whether
unpacking a factoradic is significantly less expensive than re-sorting. 
Intuitively, it should be.  In practice, I am not so sure.

Anyway, this is FWIW.  :)

-Jim Washington

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Martijn Faassen

Hey Jim,

On 3/27/07, Jim Fulton [EMAIL PROTECTED] wrote:
[snip]

 After
 that I thought we were actually going somewhere with this discussion,
 but you now strengthen this impression by apparently giving up in
 exasparation. That is what is making *me* slightly exasparated. :)

Fine. I gave up because you dismiss literature and theory. You seem
to imply that in practice there is some kind of magic that will
somehow make sorts go fast despite any theoretical basis. This
convinced me that further discussion was a waste of time.


I can see how you interpreted my statements that way. My apologies; I
did not intend to dismiss literature and theory nor did I wish to give
the impression that I do.

All I tried to express is that algorithmic scalability is just one
aspect of scalability issues, and lower-level design choices and
constant factors do matter in the final execution. I don't know how
much - getting a rough idea requires the comparisons we talked about,
and I intend to do this at some point. :) And theory clearly puts
upper limits on what is possible, and the upper limits (which are not
as high as people would naively expect). I was assuming you understood
that I understood the fact that theory provides a limit as a given. :)

I realize that there is no way to improve things fundamentally in all
cases, and I realize that writing an easy-enough API that helps
matters in at least some cases will be a challenge.

Regards,

Martijn
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Dieter Maurer
Jim Washington wrote at 2007-3-27 08:24 -0400:
 ...
If you see a sort order as one permutation of a list, the factoradic
technique provides a key to that permutation.  So, in theory, one would
sort the list, and store the factoradic index for that permutation.  The
next time the particular sort order is requested, it's a simple matter
of unpacking the factoradic. I have not done any tests to see whether
unpacking a factoradic is significantly less expensive than re-sorting. 
Intuitively, it should be.  In practice, I am not so sure.

In what way is it different from caching?
The packed factoradic seems like a cache to me.


-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Dieter Maurer
Lennart Regebro wrote at 2007-3-27 11:59 +0200:
 ...
OK, at least this avoids the big intermediate results when searching
over several indexes. But you still have to get all of the results,
and sort them before you can return the X first. I have the impression
that Lucene somehow solves this with their sorting indexes, but I'm
not sure, and I haven't tried to understand the code.

The ZCatalog, too, can use sorting indexes (and AdvancedQuery
which stole the idea from ZCatalog).

However, this approach is only efficient when the sort index size
is small compared to the result size.

If the result size is large, the sorting index can only help you
to get the sorting values quite fast as they are stored compactly
together.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Dieter Maurer
Jim Fulton wrote at 2007-3-26 15:55 -0400:
 ...

On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:

 Jim Fulton wrote at 2007-3-25 09:53 -0400:

 On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
 MF I think one of the main limitations of the current catalog (and
 MF hurry.query) is efficient support for sorting and batching the
 query
 MF results. The Zope 3 catalog returns all matching results, which
 can then
 MF be sorted and batched. This will stop being scalable for large
 MF collections. A relational database is able to do this
 internally, and is
 MF potentially able to use optimizations there.

 What evidence to you have to support this assertion?  We did some
 literature search on this a few years ago and found no special trick
 to avoid sorting costs.

 I know of 2 approaches to reducing sort cost:

 1. Sort your results based on the primary key and therefore, pick
 your primary key to match your sort results.  In terms of the Zope
 catalog framework, the primary keys are the document IDs, which are
 traditionally chosen randomly.  You can pick your primary keys based
 on a desired sort order instead. A variation on this theme is to use
 multiple sets of document ids,  storing multiple sets of ids in each
 index.  Of course, this approach doesn't help with something like
 relevance ranks.

 2. Use an N-best algorithm.  If N is the size of the batch and M is
 the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
 is a significant improvement if N  M, but still quite expensive.

 The major costs in sorting are usually not the log(n) but
 the very high linear costs fetching the sort keys (although for  
 huge n,
 we will reach the asymptotic limits).

Right. The problem is the N not the log(N). :)


 Under normal conditions, a relational database can be far more  
 efficient
 to fetch values either from index structures or the data records
 than Zope -- as

   * its data representation is much more compact

   * it often supports direct access

   * the server itself can access and process all data.


 With the ZODB, the data is hidden in pickles (less compact), there is
 no direct access (instead the complete pickle need to be decoded)

The catalog sort index mechanism uses the un-index data structure in  
the sort index to get sort keys. This is a pretty compact data  
structure.

The data usually is in IOBuckets which contain 45 values on the average.
In a corresponding relational index structure, you could have several hundreds
of values.

 and
 all operations are done in the client (rather than in the server).

Which is often fine if the desired data are in the client cache.  It  
avoids making the storage a bottleneck.

Yes, but the if is important. Quite often, some operations flush
almost all objects from the cache (partly because the cache is controlled
by the number of objects and not by their size) and after that,
filling the cache again takes ages.

Moreover, a relational database can (and usually does) use caching as
well. It is not restricted to a cliend side only technique.

I know that most relational database backends have a different architecture
than ZEO: use one process (or at least one thread) per connection
such that several activities can interleave the high IO wait times.
The one thread ZEO architecture must take more care not to become the
bottleneck.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-27 Thread Jim Washington
Dieter Maurer wrote:
 Jim Washington wrote at 2007-3-27 08:24 -0400:
   
 ...
 If you see a sort order as one permutation of a list, the factoradic
 technique provides a key to that permutation.  So, in theory, one would
 sort the list, and store the factoradic index for that permutation.  The
 next time the particular sort order is requested, it's a simple matter
 of unpacking the factoradic. I have not done any tests to see whether
 unpacking a factoradic is significantly less expensive than re-sorting. 
 Intuitively, it should be.  In practice, I am not so sure.
 

 In what way is it different from caching?
 The packed factoradic seems like a cache to me.


   
Hi, Dieter

A factoradic index is representable as a long integer.  Given that
integer and the canonical list, you can regenerate the permutation
represented by that integer.  So, instead of caching the sorted list
itself, you find and keep this integer, which is all the information
needed to algorithmically re-obtain the sorted list.

So, this would be slower than caching, but (I think) faster than re-sorting.

-Jim Washington

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-26 Thread Jim Fulton


On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:


Jim Fulton wrote at 2007-3-25 09:53 -0400:


On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:

MF I think one of the main limitations of the current catalog (and
MF hurry.query) is efficient support for sorting and batching the
query
MF results. The Zope 3 catalog returns all matching results, which
can then
MF be sorted and batched. This will stop being scalable for large
MF collections. A relational database is able to do this
internally, and is
MF potentially able to use optimizations there.


What evidence to you have to support this assertion?  We did some
literature search on this a few years ago and found no special trick
to avoid sorting costs.

I know of 2 approaches to reducing sort cost:

1. Sort your results based on the primary key and therefore, pick
your primary key to match your sort results.  In terms of the Zope
catalog framework, the primary keys are the document IDs, which are
traditionally chosen randomly.  You can pick your primary keys based
on a desired sort order instead. A variation on this theme is to use
multiple sets of document ids,  storing multiple sets of ids in each
index.  Of course, this approach doesn't help with something like
relevance ranks.

2. Use an N-best algorithm.  If N is the size of the batch and M is
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
is a significant improvement if N  M, but still quite expensive.


The major costs in sorting are usually not the log(n) but
the very high linear costs fetching the sort keys (although for  
huge n,

we will reach the asymptotic limits).


Right. The problem is the N not the log(N). :)


Under normal conditions, a relational database can be far more  
efficient

to fetch values either from index structures or the data records
than Zope -- as

  * its data representation is much more compact

  * it often supports direct access

  * the server itself can access and process all data.


With the ZODB, the data is hidden in pickles (less compact), there is
no direct access (instead the complete pickle need to be decoded)


The catalog sort index mechanism uses the un-index data structure in  
the sort index to get sort keys. This is a pretty compact data  
structure.



and
all operations are done in the client (rather than in the server).


Which is often fine if the desired data are in the client cache.  It  
avoids making the storage a bottleneck.


Jim

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-25 Thread Jim Fulton


On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:

MF I think one of the main limitations of the current catalog (and
MF hurry.query) is efficient support for sorting and batching the  
query
MF results. The Zope 3 catalog returns all matching results, which  
can then

MF be sorted and batched. This will stop being scalable for large
MF collections. A relational database is able to do this  
internally, and is

MF potentially able to use optimizations there.


What evidence to you have to support this assertion?  We did some  
literature search on this a few years ago and found no special trick  
to avoid sorting costs.


I know of 2 approaches to reducing sort cost:

1. Sort your results based on the primary key and therefore, pick  
your primary key to match your sort results.  In terms of the Zope  
catalog framework, the primary keys are the document IDs, which are  
traditionally chosen randomly.  You can pick your primary keys based  
on a desired sort order instead. A variation on this theme is to use  
multiple sets of document ids,  storing multiple sets of ids in each  
index.  Of course, this approach doesn't help with something like  
relevance ranks.


2. Use an N-best algorithm.  If N is the size of the batch and M is  
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which  
is a significant improvement if N  M, but still quite expensive.


I don't think relational databases have any magic bullet to get  
around sorting costs.  Sorting is expensive.  In many ways, I think  
the sorting support in the catalog gave people a false sense of  
security.


Jim

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-25 Thread Lennart Regebro

On 3/25/07, Jim Fulton [EMAIL PROTECTED] wrote:

On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
 MF I think one of the main limitations of the current catalog (and
 MF hurry.query) is efficient support for sorting and batching the
 query
 MF results. The Zope 3 catalog returns all matching results, which
 can then
 MF be sorted and batched. This will stop being scalable for large
 MF collections. A relational database is able to do this
 internally, and is
 MF potentially able to use optimizations there.

What evidence to you have to support this assertion?  We did some
literature search on this a few years ago and found no special trick
to avoid sorting costs.

I know of 2 approaches to reducing sort cost:

1. Sort your results based on the primary key and therefore, pick
your primary key to match your sort results.  In terms of the Zope
catalog framework, the primary keys are the document IDs, which are
traditionally chosen randomly.  You can pick your primary keys based
on a desired sort order instead. A variation on this theme is to use
multiple sets of document ids,  storing multiple sets of ids in each
index.  Of course, this approach doesn't help with something like
relevance ranks.

2. Use an N-best algorithm.  If N is the size of the batch and M is
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
is a significant improvement if N  M, but still quite expensive.

I don't think relational databases have any magic bullet to get
around sorting costs.  Sorting is expensive.  In many ways, I think
the sorting support in the catalog gave people a false sense of
security.


I don't know if relational databases typically does this internally (I
don't think so). However, some search engines do it, like Lucene. And
supposedly also Dieters IncrementalSearch (haven't used it yet).

--
Lennart Regebro: Zope and Plone consulting.
http://www.colliberty.com/
+33 661 58 14 64
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter

2007-03-25 Thread Jim Fulton


On Mar 25, 2007, at 11:08 AM, Lennart Regebro wrote:
...

2. Use an N-best algorithm.  If N is the size of the batch and M is
the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
is a significant improvement if N  M, but still quite expensive.

I don't think relational databases have any magic bullet to get
around sorting costs.  Sorting is expensive.  In many ways, I think
the sorting support in the catalog gave people a false sense of
security.


I don't know if relational databases typically does this internally (I
don't think so). However, some search engines do it, like Lucene. And
supposedly also Dieters IncrementalSearch (haven't used it yet).


Our catalog framework also has N-best support.

JIm

--
Jim Fulton  mailto:[EMAIL PROTECTED]Python 
Powered!
CTO (540) 361-1714  
http://www.python.org
Zope Corporationhttp://www.zope.com http://www.zope.org



___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com