-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roché Compaan wrote: > On Mon, 2008-10-27 at 11:32 -0500, Alan Runyan wrote: >> I agree with Tres. A lot more can be done with Indexes and Catalog >> without caching. >> >> The most exiciting development in Catalog optimizations comes out >> Jarn. Helge Tesdal (iirc) did a buncha work at a RDBMS company when >> he was in college. He has a protoype of a query plan for ZCatalog. >> >> http://www.jarn.com/blog/catalog-query-plan >> >> I would like to ask Roche and others to look at the Query Plan. > > We looked at query plan but it didn't help us in any way. Some catalog > indexes are performing very badly and most of our content is in a > published state which doesn't help the query plan much. > >> Caching is a total PITA because invalidation machinery becomes >> overwhelming complex and unwieldly quickly in production. >> > > I agree but this was the only thing that we could do to even go into > production. > > Since I'm in full agreement that we need to fix indexes that are > problematic, I started doing some benchmarks on the large data set that > gave us so many headaches. It is probably not surprising that the more > complex indexes are performing badly. DateRangeIndex, KeywordIndex and > Plone's ExtendedPathIndex performed the worst. Below are some stats > showing timings around the "apply_index" call in Catalog.py that was > done while testing the application with real data: > > Index Name |Type |Avg Time |Calls/second > ============================================================== > object_implements |KeywordIndex |0.2172234| 4.6
This is clearly not the same issue as the other KeywordIndexes: in fact, I am astonished that anybody would be using a KeywordIndex for this at all. I would suspect that the real problem here is in the appliation, rather than the index itself. > getEffective_or_creat|DateIndex |0.1941770| 5.15 > effectiveRange |DateRangeIndex |0.0086295| 115.88 > allowedRolesAndUsers |KeywordIndex |0.0069754| 143.36 Hmm, I'm surprised there: what query is being passed to 'apply_index' for this call? > path |ExtendedPathIndex|0.0040614| 246.22 I don't trust the EPI implementation at all. > portal_type |FieldIndex |0.0025984| 384.84 This one is surprising: its performance should be pretty similar to the other FieldIndexes (e.g., 'review_state') which map a controlled vocabulary onto the entire corpus. Was the query different than 'review_state' (e.g., multi-valued vs. single-valued)? > SearchableText |ZCTextIndex |0.0007645| 1308.04 > sourceUID |FieldIndex |0.0004886| 2046.31 Probably bogus, but I don't know how it is used. > UID |FieldIndex |0.0003070| 3257.1 Note that this is the worst-case scenario for a FieldIndex: there is exactly one value for every key. This shouldn't be "indexed" at all, in fact, beyond a simple BTree (UID -> rid). > targetUID |FieldIndex |0.0002287| 4372.12 I don't know what this one is used for, but it should probably be scrapped as well. > exact_getUserId |FieldIndex |0.0001931| 5177.79 > exact_getUserName |FieldIndex |0.0001816| 5504.39 I don't know how the application uses either of those indexes, but they are almost certainly bogus in any normal catalog. > relationship |FieldIndex |0.0000822| 12153.1 > id |FieldIndex |0.0000822| 12161.81 > end |DateIndex |0.0000623| 16027.48 > getGroups |FieldIndex |0.0000278| 35973.45 This is almost certainly bogus: FieldIndex is not supposed to be used with multi-valued terms. > getArtistTitle |FieldIndex |0.0000259| 38495.53 > review_state |FieldIndex |0.0000259| 38582.22 > Subject |KeywordIndex |0.0000253| 39413.57 This is the use-case for which KeywordIndex is designed. Was the query just a single term, by chance? > getDaysOfTheWeek |KeywordIndex |0.0000247| 40465.98 > meta_type |FieldIndex |0.0000199| 50116.64 > exact_getGroupId |FieldIndex |0.0000162| 61417.51 > getVideoURL |FieldIndex |0.0000155| 64447.5 > year |FieldIndex |0.0000155| 64460.43 > Title |FieldIndex |0.0000136| 73381.01 > getId |FieldIndex |0.0000131| 76056.97 > Title |ZCTextIndex |0.0000128| 77809.46 This should be removed: there is no valid use case for doing a full-text search restricted only to the title. > startendrange |DateRangeIndex |0.0000127| 78485.82 > expires |DateIndex |0.0000126| 79001.59 > getObjPositionInParen|FieldIndex |0.0000124| 80675.9 > targetId |FieldIndex |0.0000122| 81418.68 > effective |DateIndex |0.0000121| 82651.7 > getProvince |FieldIndex |0.0000117| 85198.54 > month |FieldIndex |0.0000116| 85762.56 > Description |ZCTextIndex |0.0000116| 86241.39 Again, should be removed. > Type |FieldIndex |0.0000115| 86345.17 > getLast_login_time |DateIndex |0.0000115| 86698.98 > Creator |FieldIndex |0.0000113| 87840.03 > getEmail |ZCTextIndex |0.0000113| 87849.05 Should *definitely* be removed: how can you do full-text search on an e-mail address? > cmf_uid |FieldIndex |0.0000113| 88352.13 > getDuration |FieldIndex |0.0000113| 88454.29 > SearchableText |TextIndex |0.0000113| 88466.69 Where did this one come from? The 'SearchableText' above is a ZCTextIndex. > sortable_title |FieldIndex |0.0000112| 88698.49 > getRating |FieldIndex |0.0000112| 88747.5 > getGenres |KeywordIndex |0.0000112| 88796.55 > object_provides |KeywordIndex |0.0000112| 88919.43 > getEventType |KeywordIndex |0.0000112| 88953.9 > in_reply_to |FieldIndex |0.0000112| 89057.46 > getReview_state |FieldIndex |0.0000112| 89124.63 > is_folderish |FieldIndex |0.0000112| 89240.51 > getRawRelatedItems |KeywordIndex |0.0000111| 89568.91 > getThumbSize |FieldIndex |0.0000111| 89653.89 > getStudioCamURL |FieldIndex |0.0000111| 89678.92 > Date |DateIndex |0.0000111| 89799.23 > getHash |FieldIndex |0.0000111| 90111.54 > getNumberOfComments |FieldIndex |0.0000110| 90141.88 > start |DateIndex |0.0000110| 90400.59 > getPercentage |FieldIndex |0.0000110| 90420.94 > is_default_page |FieldIndex |0.0000110| 90446.4 > modified |DateIndex |0.0000106| 93506.29 > created |DateIndex |0.0000106| 93678.59 > getGroupId |ZCTextIndex |0.0000105| 94962.39 > getUserId |ZCTextIndex |0.0000105| 95165.88 > getFullname |ZCTextIndex |0.0000104| 95313.06 Scrap these. > getRoles |FieldIndex |0.0000104| 95385.31 > getUserName |ZCTextIndex |0.0000103| 96692.46 Scrap this one. Can you provide information on the corpus / configuration / test plan you used to generate these results? Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 [EMAIL PROTECTED] Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJGFxi+gerLs4ltQ4RAikKAJ93J3XyhecUq6kogTFsJnLXxgjh8QCgrZEw xdycjo0+4WVdSGFKhVtenms= =kYd+ -----END PGP SIGNATURE----- _______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )