Excellent analysis, many thanks Sean! This is much-needed info for people whom are attempting to scale.
----- Original Message ----- From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, December 09, 2001 10:36 PM Subject: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder... > Interesting FYI for those looking to support lots of cataloged objects in > ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k > Cataloged objects (customer database) in a single BTreeFolder-derived > container; these objects are 'proxy' objects which each expose a single > record in a relational dataset, and allow about 8 fields to be indexed (2 of > which, TextIndexes). > > Some informal stress tests using 100k+ _Cataloged_ objects in a BTreeFolder > in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be successful, but > not without some stubborn investigation and a few caveats. > > BTreeFolder, using ObjectManager APIs, frankly, just won't scale for > bulk-adds of objects to folders. I was adding CatalogAware objects to my > folder (including index_object()). After waiting for bulk-add processes to > finish after running for 2 days, I killed Zope and started trying to > optimize, figuring that the problem was related to Catalog and my own RDB > access code, and got nowhere (well, I tuned my app, bu this didn't solve my > problem), then went to #zope, got a few ideas, and ended up with the > conclusion that my problem was not Catalog-related, but related to > BTreeFolder; I initially thought it was a problem with the C-Based generic > BTree implementation scaling well past 10k objects, but felt I couldn't > point the finger at that before some more basic stuff was ruled out. > > The easiest thing to do in this case, was to figure out what was heavily > accessing the BTree via its dictionary-like interface, and the thought > occurred to me that there might be multiple has_key checks, security stuff, > and the like called by ObjectManager._setObject(), and I was right. I > figured a switch to use the simple BasicBTreeFolder._setOb() for my stress > tests might reveal an increase in speed, and... > > ...it works, acceptably, no less, on my slow laptop for 100,000 objects. It > took ~50 minutes to do this on meager hardware with a 4200 RPM ide disk, and > I figure a bulk add process like this on fast, new hardware (i.e. something > with upwards of 22k pystones and lots of RAM) with a dedicated server for my > RDB, would likely take 1/5th this time, or about 10 minutes (by increasing > both MySQL performance, and Zope performance); combine this with ZEO and > have a dedicated node do this, and I think this is a small amount of proof > of Zope's ability to scale to many objects. (See my caveats at the bottom of > this message, though). > > After days of frustration, I'm actually impressed by what I found: My > data-access APIs are very computationally expensive, since they establish a > MySQLdb cursor object for each call and execute a query; these data access > methods used in bulk adding 100k objects after using _setOb() during > Cataloging via index_object() (the transaction done all in memory for now, > but likely moved to subtransactions soon to support up to 4x that data). > > So far, the moral of the story: use _setOb(), not _setObject() for this many > objects! > > I haven't seen any material documenting anything like this for BTreeFolder, > so I figured I would share with zope-dev what I found in the hopes that > developers creating products with BTreeFolder and/or future implementations > of BTreeFolder might take this into account, in docs, if nothing else. > > Caveats: > - I'm using FileStorage and an old version of Zope (2.3.3). I can't say how > this will perform with Python 2.1/Zope 2.[4/5]. I imagine that one would > want to pack the storage between full rebuilds or have very, very fast > storage hardware. > > - Catalog searches without any limiting queries to indexes will simply be > too slow for practical use with this many objects, so they need to be > forbidden with a permission to prevent accidental over-utilization of system > resources or DOS-style attacks. Otherwise, Catalog searches on my slow hard > drive seem acceptable. > > - I'm not too concerned with BTreeFolder __getattr__() performance > penalties, though I modified BTreeFolder.__getattr__ just in case to remove > the 'if tree and tree.has_key(name)', replacing with try/except; I'm not > sure if this helps/hinders, because my stress-test code uses _getOb() > instead. > > - objectIds() doesn't work; or, more accurately, at first glance, <dtml-var > "_.len(objectIds())"> doesn't work; I haven't tested anything else. I would > like to find out why this is, and fix it. I suppose that there is something > done in ObjectManager that BTreeFolder's simple _setOb() doesn't do. If > anyone wants to help me figure out the obvious here, I'd appreciate it. ;) > > - I don't think un-indexed access of records is likely to be very practical > with this many, esp. if things like objectIds() are broken, which increases > the value of Catalog, and I think that what my experiences here with this > project are showing is that Catalog indexing isn't as expensive/slow as I > initially thought it would be. That said, I'm sure there can be > improvements in Catalog as often is discussed here recently, but for now, I > think I'm happy. :) > > - I Haven't compared these results with OFS.Folder.Folder yet. I'm too > lazy/busy to comparison test. > > - I'm relatively sure that, in my app, the text index BTrees in the Catalog > are very 'bushy' (more so than normal) because I am indexing people's full > names, and street addresses, which means there are less common words than > indexing, say, an every-day document. > > - Also, I want to make it clear that if I had a data access API that needed > more than simple information about my datasets (i.e. I was trying to do > reporting on patterns, like CRM-ish types of applications), I would likely > wrap a function around indexes done in the RDB, not in Catalog. My requires > no reporting functionality, and thus really needs no indexes, other than for > finding a record for customer service purposes and account validation > purposes. The reason, however, that I chose ZCatalog was for full text > indexing that I could control/hack/customize easily. My slightly uninformed > belief now is that for big datasets or "enterprise" applications (whatever > that means), I would use a hybrid set of (faster) indexes using the RDB's > indexes where appropriate (heavily queried fields), and ZCatalog for > TextIndexes (convenient). I'm sure inevitable improvements to ZCatalog > (there seems to be community interest in such) will help here. > > - I wonder if "directory-storage" combined with ReiserFS might make for an > interesting future ZODB choice for this sort of app. > > Sean > > ========================= > Sean Upton > Senior Programmer/Analyst > SignOnSanDiego.com > The San Diego Union-Tribune > 619.718.5241 > [EMAIL PROTECTED] > ========================= > > _______________________________________________ > Zope-Dev maillist - [EMAIL PROTECTED] > http://lists.zope.org/mailman/listinfo/zope-dev > ** No cross posts or HTML encoding! ** > (Related lists - > http://lists.zope.org/mailman/listinfo/zope-announce > http://lists.zope.org/mailman/listinfo/zope ) > _______________________________________________ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )