Re: [Zope-dev] catalog performance: query plan

2008-11-10 Thread Lennart Regebro
On Sun, Nov 9, 2008 at 19:58, Roché Compaan [EMAIL PROTECTED] wrote:
 Since I'm in full agreement that we need to fix indexes that are
 problematic, I started doing some benchmarks on the large data set that
 gave us so many headaches. It is probably not surprising that the more
 complex indexes are performing badly. DateRangeIndex, KeywordIndex and
 Plone's ExtendedPathIndex performed the worst. Below are some stats
 showing timings around the apply_index call in Catalog.py that was
 done while testing the application with real data:

ExtendedPathIndex doesn't need fixing, but we need to stop using it.
It's done to support navigation trees from the catalog, but navigation
should not be done via the same catalog as you do other things, but a
dedicated tool. That would simplify and speed things up a lot. But OK,
that's off-topic.

-- 
Lennart Regebro: Zope and Plone consulting.
http://www.colliberty.com/
+33 661 58 14 64
___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] catalog performance: query plan

2008-11-10 Thread Lennart Regebro
On Sun, Nov 9, 2008 at 22:29, Matt Hamilton [EMAIL PROTECTED] wrote:
 Lennart Regebro regebro at gmail.com writes:

 I would be interested in seeing a bunch of Gurus sit down at some
 sprint and trying to come up with a catalog engine that is incremental
 and uses query plans. There is no reason that would not be stupidly
 fast. :) We can then make a new catalog that uses this engine but has
 the same API as the old one, to ship with some future version of Zope,
 say 2.12.

 There is the Plone Performance sprint we are hosting in Bristol, UK on the 
 11th
 - 14th Dec.

 http://plone.org/events/sprints/bristol-performance-sprint

 Whilst it is billed as a Plone sprint, of course much of the speedups can be
 done at the Zope level, so Zope-only developers are more than welcome :)

 This is exactly the kind of thing that I like hacking on personally, so would
 love to see it worked on at the sprint.

Cool. I do not have time in December though, so some other time. And
if we could get Dieter Maurer and Helge Tesdal in on this, as they has
experience and understanding of the issues that would be great. That's
probably going to take even more planning, so maybe for a future
performance sprint somewhere?

-- 
Lennart Regebro: Zope and Plone consulting.
http://www.colliberty.com/
+33 661 58 14 64
___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] catalog performance: query plan

2008-11-10 Thread Martin Aspeli
Hi Tres,

 Index Name   |Type |Avg Time |Calls/second
 ==
 object_implements|KeywordIndex |0.2172234| 4.6
 
 This is clearly not the same issue as the other KeywordIndexes:  in
 fact, I am astonished that anybody would be using a KeywordIndex for
 this at all.  I would suspect that the real problem here is in the
 appliation, rather than the index itself.

Why? object_implements indexes a list of interface dotted names. Would 
another type of index be more appropriate?

 UID  |FieldIndex   |0.0003070|  3257.1
 
 Note that this is the worst-case scenario for a FieldIndex:  there is
 exactly one value for every key.  This shouldn't be indexed at all, in
 fact, beyond a simple BTree (UID - rid).

Good point. I wonder how many places we use a UID index. UID *metadata* 
is quite important, of course.

 targetUID|FieldIndex   |0.0002287| 4372.12
 
 I don't know what this one is used for, but it should probably be
 scrapped as well.

Me neither ... sounds bogus.

 Title|ZCTextIndex  |0.128|77809.46
 
 This should be removed:  there is no valid use case for doing a
 full-text search restricted only to the title.

I'm pretty amazed that this is a ZCTextIndex as well. I always thought 
it was a FieldIndex.

 Description  |ZCTextIndex  |0.116|86241.39
 
 Again, should be removed.

Right.

 getEmail |ZCTextIndex  |0.113|87849.05
 
 Should *definitely* be removed:  how can you do full-text search on an
 e-mail address?

Surely this is application specific too? I don't think Plone has such an 
index.


 SearchableText   |TextIndex|0.113|88466.69
 
 Where did this one come from?  The 'SearchableText' above is a ZCTextIndex.

It certainly is in vanilla Plone.

Martin


-- 
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book

___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] catalog performance: query plan

2008-11-10 Thread Roché Compaan
On Mon, 2008-11-10 at 18:38 +0200, Hedley Roos wrote:
 Kinda pointless for me to continue since this is turning into a
 Plone-specific discussion on zope-dev. But at least the whole exercise
 has forced us to look in detail into how all these indexes affect
 performance with a zodb with many many objects.
 
 Roche investigated Tesdal's queryplan today end it seems to solve
 nearly all our performance problems. He'll have to elaborate.

Well that is not really true. What solved our performance problems is
not querying on object_implements and getEffective_or_created. I have
previously done benchmarks with query plan and it didn't make any
noticeable difference.

What might be true or is becoming more likely is that indexes are used
where they don't fit the use case rather than that the indexes
themselves need optimisation.

-- 
Roché Compaan
Upfront Systems   http://www.upfrontsystems.co.za

___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] catalog performance: query plan

2008-11-10 Thread Roché Compaan
On Mon, 2008-11-10 at 11:08 -0500, Tres Seaver wrote:
  Index Name   |Type |Avg Time |Calls/second
  ==
  object_implements|KeywordIndex |0.2172234| 4.6
 
 This is clearly not the same issue as the other KeywordIndexes:  in
 fact, I am astonished that anybody would be using a KeywordIndex for
 this at all.  I would suspect that the real problem here is in the
 appliation, rather than the index itself.
 
  getEffective_or_creat|DateIndex|0.1941770|5.15
  effectiveRange   |DateRangeIndex   |0.0086295|  115.88
  allowedRolesAndUsers |KeywordIndex |0.0069754|  143.36
 
 Hmm, I'm surprised there:  what query is being passed to 'apply_index'
 for this call?

Well it is not really performing badly at 6ms?

 
  path |ExtendedPathIndex|0.0040614|  246.22
 
 I don't trust the EPI implementation at all.
 
  portal_type  |FieldIndex   |0.0025984|  384.84
 
 This one is surprising:  its performance should be pretty similar to
 the
  other FieldIndexes (e.g., 'review_state') which map a controlled
 vocabulary onto the entire corpus.  Was the query different than
 'review_state' (e.g., multi-valued vs. single-valued)?

It's still not bad at 2ms. It has a lot more keys than review_state
though.

 
  SearchableText   |ZCTextIndex  |0.0007645| 1308.04
  sourceUID|FieldIndex   |0.0004886| 2046.31
 
 Probably bogus, but I don't know how it is used.

I'm not really worried about indexes beyond this point - they're all
returning results in less than a millisecond.

 Can you provide information on the corpus / configuration / test plan
 you used to generate these results?

It's basically a Plone site with 300,000 remember based users and
roughly 150,000 documents and images indexed.

-- 
Roché Compaan
Upfront Systems   http://www.upfrontsystems.co.za

___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] catalog performance: query plan

2008-11-10 Thread Laurence Rowe
Lennart Regebro wrote:
 On Sun, Nov 9, 2008 at 19:58, Roché Compaan [EMAIL PROTECTED] wrote:
 Since I'm in full agreement that we need to fix indexes that are
 problematic, I started doing some benchmarks on the large data set that
 gave us so many headaches. It is probably not surprising that the more
 complex indexes are performing badly. DateRangeIndex, KeywordIndex and
 Plone's ExtendedPathIndex performed the worst. Below are some stats
 showing timings around the apply_index call in Catalog.py that was
 done while testing the application with real data:
 
 ExtendedPathIndex doesn't need fixing, but we need to stop using it.
 It's done to support navigation trees from the catalog, but navigation
 should not be done via the same catalog as you do other things, but a
 dedicated tool. That would simplify and speed things up a lot. But OK,
 that's off-topic.
 

I wander if this could be replaced by zc.relationship / plone.relations?

There is potential for removing the five.intid / zope.app.keyreference 
layer of indirection if the actual oid was stored instead, with an index 
to a list of database names packed into the first byte. There would even 
be room to store a reference to the objects class (using the pickle 
protocol 2 registry to convert this to an integer) in the next two or 
three bytes if creating ghosts were useful. This would still leave at 
least 32 bits of space (4 billion) for the actual object id.

Without storing the aq_chain explicitly we would need to ensure that 
__parent__ pointers were pickled for all content objects. The objects 
themselves could be used instead of metadata rows (without a security 
check it would be as simple as loading the oid from the relevant db 
connection). So long as all the required metadata was stored on the 
object itself only one load would be required for each object.

If this same keyreference were used in the indexes of the catalog 
instead of rowids then result sets could be merged.

The downside is that the set intersections would require double the 
memory of the current 32 bit ids.

Laurence

___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] catalog performance: query plan

2008-11-09 Thread Matt Hamilton
Lennart Regebro regebro at gmail.com writes:

 I would be interested in seeing a bunch of Gurus sit down at some
 sprint and trying to come up with a catalog engine that is incremental
 and uses query plans. There is no reason that would not be stupidly
 fast. :) We can then make a new catalog that uses this engine but has
 the same API as the old one, to ship with some future version of Zope,
 say 2.12.

There is the Plone Performance sprint we are hosting in Bristol, UK on the 11th
- 14th Dec.

http://plone.org/events/sprints/bristol-performance-sprint

Whilst it is billed as a Plone sprint, of course much of the speedups can be
done at the Zope level, so Zope-only developers are more than welcome :)

This is exactly the kind of thing that I like hacking on personally, so would
love to see it worked on at the sprint.

-Matt

-- 
Matt Hamilton   [EMAIL PROTECTED]
Netsight Internet Solutions, Ltd.   Understand. Develop. Deliver
http://www.netsight.co.uk +44 (0)117 9090901
Web Design | Zope/Plone Development  Consulting | Co-location | Hosting




___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] catalog performance: query plan

2008-11-09 Thread Roché Compaan
On Mon, 2008-10-27 at 11:32 -0500, Alan Runyan wrote:
 I agree with Tres.  A lot more can be done with Indexes and Catalog
 without caching.
 
 The most exiciting development in Catalog optimizations comes out
 Jarn.  Helge Tesdal (iirc) did a buncha work  at a RDBMS company when
 he was in college.  He has a protoype of a query plan for ZCatalog.
 
 http://www.jarn.com/blog/catalog-query-plan
 
 I would like to ask Roche and others to look at the Query Plan.

We looked at query plan but it didn't help us in any way. Some catalog
indexes are performing very badly and most of our content is in a
published state which doesn't help the query plan much.

 Caching is a total PITA because invalidation machinery becomes
 overwhelming complex and unwieldly quickly in production.
 

I agree but this was the only thing that we could do to even go into
production.

Since I'm in full agreement that we need to fix indexes that are
problematic, I started doing some benchmarks on the large data set that
gave us so many headaches. It is probably not surprising that the more
complex indexes are performing badly. DateRangeIndex, KeywordIndex and
Plone's ExtendedPathIndex performed the worst. Below are some stats
showing timings around the apply_index call in Catalog.py that was
done while testing the application with real data:

Index Name   |Type |Avg Time |Calls/second
==
object_implements|KeywordIndex |0.2172234| 4.6
getEffective_or_creat|DateIndex|0.1941770|5.15
effectiveRange   |DateRangeIndex   |0.0086295|  115.88
allowedRolesAndUsers |KeywordIndex |0.0069754|  143.36
path |ExtendedPathIndex|0.0040614|  246.22
portal_type  |FieldIndex   |0.0025984|  384.84
SearchableText   |ZCTextIndex  |0.0007645| 1308.04
sourceUID|FieldIndex   |0.0004886| 2046.31
UID  |FieldIndex   |0.0003070|  3257.1
targetUID|FieldIndex   |0.0002287| 4372.12
exact_getUserId  |FieldIndex   |0.0001931| 5177.79
exact_getUserName|FieldIndex   |0.0001816| 5504.39
relationship |FieldIndex   |0.822| 12153.1
id   |FieldIndex   |0.822|12161.81
end  |DateIndex|0.623|16027.48
getGroups|FieldIndex   |0.278|35973.45
getArtistTitle   |FieldIndex   |0.259|38495.53
review_state |FieldIndex   |0.259|38582.22
Subject  |KeywordIndex |0.253|39413.57
getDaysOfTheWeek |KeywordIndex |0.247|40465.98
meta_type|FieldIndex   |0.199|50116.64
exact_getGroupId |FieldIndex   |0.162|61417.51
getVideoURL  |FieldIndex   |0.155| 64447.5
year |FieldIndex   |0.155|64460.43
Title|FieldIndex   |0.136|73381.01
getId|FieldIndex   |0.131|76056.97
Title|ZCTextIndex  |0.128|77809.46
startendrange|DateRangeIndex   |0.127|78485.82
expires  |DateIndex|0.126|79001.59
getObjPositionInParen|FieldIndex   |0.124| 80675.9
targetId |FieldIndex   |0.122|81418.68
effective|DateIndex|0.121| 82651.7
getProvince  |FieldIndex   |0.117|85198.54
month|FieldIndex   |0.116|85762.56
Description  |ZCTextIndex  |0.116|86241.39
Type |FieldIndex   |0.115|86345.17
getLast_login_time   |DateIndex|0.115|86698.98
Creator  |FieldIndex   |0.113|87840.03
getEmail |ZCTextIndex  |0.113|87849.05
cmf_uid  |FieldIndex   |0.113|88352.13
getDuration  |FieldIndex   |0.113|88454.29
SearchableText   |TextIndex|0.113|88466.69
sortable_title   |FieldIndex   |0.112|88698.49
getRating|FieldIndex   |0.112| 88747.5
getGenres|KeywordIndex |0.112|88796.55
object_provides  |KeywordIndex |0.112|88919.43
getEventType |KeywordIndex |0.112| 88953.9
in_reply_to  |FieldIndex   |0.112|89057.46
getReview_state  |FieldIndex   |0.112|89124.63
is_folderish |FieldIndex   |0.112|89240.51
getRawRelatedItems   |KeywordIndex |0.111|89568.91
getThumbSize |FieldIndex   |0.111|89653.89
getStudioCamURL  |FieldIndex   |0.111|89678.92
Date |DateIndex|0.111|89799.23
getHash  |FieldIndex   |0.111|90111.54
getNumberOfComments  |FieldIndex   |0.110|90141.88
start   

Re: [Zope-dev] catalog performance: query plan

2008-11-06 Thread Lennart Regebro
On Mon, Oct 27, 2008 at 17:32, Alan Runyan [EMAIL PROTECTED] wrote:
 I agree with Tres.  A lot more can be done with Indexes and Catalog
 without caching.

 The most exiciting development in Catalog optimizations comes out
 Jarn.  Helge Tesdal (iirc) did a buncha work  at a RDBMS company when
 he was in college.  He has a protoype of a query plan for ZCatalog.

 http://www.jarn.com/blog/catalog-query-plan

 I would like to ask Roche and others to look at the Query Plan.

 Caching is a total PITA because invalidation machinery becomes
 overwhelming complex and unwieldly quickly in production.

I don't know very much about searching, but this definitely sounds
like a good idea. Also, especially when doing free text searching that
has large result sets, incremental searching is very beneficial. I
know Dieter Maurer has made a Zope2 implementation of this.

I would be interested in seeing a bunch of Gurus sit down at some
sprint and trying to come up with a catalog engine that is incremental
and uses query plans. There is no reason that would not be stupidly
fast. :) We can then make a new catalog that uses this engine but has
the same API as the old one, to ship with some future version of Zope,
say 2.12.

-- 
Lennart Regebro: Zope and Plone consulting.
http://www.colliberty.com/
+33 661 58 14 64
___
Zope-Dev maillist  -  Zope-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )