Interesting FYI for those looking to support lots of cataloged objects in ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k Cataloged objects (customer database) in a single BTreeFolder-derived container; these objects are 'proxy' objects which each expose a single record in a relational dataset, and allow about 8 fields to be indexed (2 of which, TextIndexes).
Some informal stress tests using 100k+ _Cataloged_ objects in a BTreeFolder in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be successful, but not without some stubborn investigation and a few caveats. BTreeFolder, using ObjectManager APIs, frankly, just won't scale for bulk-adds of objects to folders. I was adding CatalogAware objects to my folder (including index_object()). After waiting for bulk-add processes to finish after running for 2 days, I killed Zope and started trying to optimize, figuring that the problem was related to Catalog and my own RDB access code, and got nowhere (well, I tuned my app, bu this didn't solve my problem), then went to #zope, got a few ideas, and ended up with the conclusion that my problem was not Catalog-related, but related to BTreeFolder; I initially thought it was a problem with the C-Based generic BTree implementation scaling well past 10k objects, but felt I couldn't point the finger at that before some more basic stuff was ruled out. The easiest thing to do in this case, was to figure out what was heavily accessing the BTree via its dictionary-like interface, and the thought occurred to me that there might be multiple has_key checks, security stuff, and the like called by ObjectManager._setObject(), and I was right. I figured a switch to use the simple BasicBTreeFolder._setOb() for my stress tests might reveal an increase in speed, and... ...it works, acceptably, no less, on my slow laptop for 100,000 objects. It took ~50 minutes to do this on meager hardware with a 4200 RPM ide disk, and I figure a bulk add process like this on fast, new hardware (i.e. something with upwards of 22k pystones and lots of RAM) with a dedicated server for my RDB, would likely take 1/5th this time, or about 10 minutes (by increasing both MySQL performance, and Zope performance); combine this with ZEO and have a dedicated node do this, and I think this is a small amount of proof of Zope's ability to scale to many objects. (See my caveats at the bottom of this message, though). After days of frustration, I'm actually impressed by what I found: My data-access APIs are very computationally expensive, since they establish a MySQLdb cursor object for each call and execute a query; these data access methods used in bulk adding 100k objects after using _setOb() during Cataloging via index_object() (the transaction done all in memory for now, but likely moved to subtransactions soon to support up to 4x that data). So far, the moral of the story: use _setOb(), not _setObject() for this many objects! I haven't seen any material documenting anything like this for BTreeFolder, so I figured I would share with zope-dev what I found in the hopes that developers creating products with BTreeFolder and/or future implementations of BTreeFolder might take this into account, in docs, if nothing else. Caveats: - I'm using FileStorage and an old version of Zope (2.3.3). I can't say how this will perform with Python 2.1/Zope 2.[4/5]. I imagine that one would want to pack the storage between full rebuilds or have very, very fast storage hardware. - Catalog searches without any limiting queries to indexes will simply be too slow for practical use with this many objects, so they need to be forbidden with a permission to prevent accidental over-utilization of system resources or DOS-style attacks. Otherwise, Catalog searches on my slow hard drive seem acceptable. - I'm not too concerned with BTreeFolder __getattr__() performance penalties, though I modified BTreeFolder.__getattr__ just in case to remove the 'if tree and tree.has_key(name)', replacing with try/except; I'm not sure if this helps/hinders, because my stress-test code uses _getOb() instead. - objectIds() doesn't work; or, more accurately, at first glance, <dtml-var "_.len(objectIds())"> doesn't work; I haven't tested anything else. I would like to find out why this is, and fix it. I suppose that there is something done in ObjectManager that BTreeFolder's simple _setOb() doesn't do. If anyone wants to help me figure out the obvious here, I'd appreciate it. ;) - I don't think un-indexed access of records is likely to be very practical with this many, esp. if things like objectIds() are broken, which increases the value of Catalog, and I think that what my experiences here with this project are showing is that Catalog indexing isn't as expensive/slow as I initially thought it would be. That said, I'm sure there can be improvements in Catalog as often is discussed here recently, but for now, I think I'm happy. :) - I Haven't compared these results with OFS.Folder.Folder yet. I'm too lazy/busy to comparison test. - I'm relatively sure that, in my app, the text index BTrees in the Catalog are very 'bushy' (more so than normal) because I am indexing people's full names, and street addresses, which means there are less common words than indexing, say, an every-day document. - Also, I want to make it clear that if I had a data access API that needed more than simple information about my datasets (i.e. I was trying to do reporting on patterns, like CRM-ish types of applications), I would likely wrap a function around indexes done in the RDB, not in Catalog. My requires no reporting functionality, and thus really needs no indexes, other than for finding a record for customer service purposes and account validation purposes. The reason, however, that I chose ZCatalog was for full text indexing that I could control/hack/customize easily. My slightly uninformed belief now is that for big datasets or "enterprise" applications (whatever that means), I would use a hybrid set of (faster) indexes using the RDB's indexes where appropriate (heavily queried fields), and ZCatalog for TextIndexes (convenient). I'm sure inevitable improvements to ZCatalog (there seems to be community interest in such) will help here. - I wonder if "directory-storage" combined with ReiserFS might make for an interesting future ZODB choice for this sort of app. Sean ========================= Sean Upton Senior Programmer/Analyst SignOnSanDiego.com The San Diego Union-Tribune 619.718.5241 [EMAIL PROTECTED] ========================= _______________________________________________ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )