Re: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...
At 04:08 PM 12/10/01 +, Tony McDonald wrote: >On 10/12/01 2:54 pm, "Phillip J. Eby" <[EMAIL PROTECTED]> wrote: > > > I'm not sure if this is taken into consideration in your work so far/future > > plans... but just in case you were unaware, it is not necessary for you to > > persistently store objects in the ZODB that you intend to index in a > > ZCatalog. All that is required is that the object to be cataloged is > > accessible via a URL path. ZSQL methods can be set up to be > > URL-traversable, and to wrap a class around the returned row. To load the > > items into the catalog, you can use a PythonScript or similar to loop over > > a multi-row query, passing the objects directly to the catalog along with a > > path that matches the one they'll be retrievable from. This approach would > > eliminate the need for BTreeFolder altogether, although of course it > > requires access to the RDBMS for retrievals. This should reduce the number > > of writes and allow for bigger subtransactions in a given quantity of > memory. > >Gad! - are you saying you don't need to store a 1Mb .doc file into the ZODB, >but can still index the thing, store the index information in the Zcatalog >(presumably a lot smaller than 1Mb) and have the actual file accessible from >a file system URL? If so, that's really neat! Yep. By "URL path", though, I meant a *Zope* path. However it would be straightforward to create a Zope object that represents a filesystem path and does traversal/retrieval, assuming that one of the 'FS'-products out there doesn't already do this for you. Chris Withers has pointed out that technically you don't even need the path string to be valid, it just has to be unique. However, the standard tools and the method for getting the "real object" referred to by the catalog record do expect it to be a valid path IIRC. I personally find it most convenient, therefore, to use a real Zope path. ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...
Excellent analysis, many thanks Sean! This is much-needed info for people whom are attempting to scale. - Original Message - From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, December 09, 2001 10:36 PM Subject: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder... > Interesting FYI for those looking to support lots of cataloged objects in > ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k > Cataloged objects (customer database) in a single BTreeFolder-derived > container; these objects are 'proxy' objects which each expose a single > record in a relational dataset, and allow about 8 fields to be indexed (2 of > which, TextIndexes). > > Some informal stress tests using 100k+ _Cataloged_ objects in a BTreeFolder > in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be successful, but > not without some stubborn investigation and a few caveats. > > BTreeFolder, using ObjectManager APIs, frankly, just won't scale for > bulk-adds of objects to folders. I was adding CatalogAware objects to my > folder (including index_object()). After waiting for bulk-add processes to > finish after running for 2 days, I killed Zope and started trying to > optimize, figuring that the problem was related to Catalog and my own RDB > access code, and got nowhere (well, I tuned my app, bu this didn't solve my > problem), then went to #zope, got a few ideas, and ended up with the > conclusion that my problem was not Catalog-related, but related to > BTreeFolder; I initially thought it was a problem with the C-Based generic > BTree implementation scaling well past 10k objects, but felt I couldn't > point the finger at that before some more basic stuff was ruled out. > > The easiest thing to do in this case, was to figure out what was heavily > accessing the BTree via its dictionary-like interface, and the thought > occurred to me that there might be multiple has_key checks, security stuff, > and the like called by ObjectManager._setObject(), and I was right. I > figured a switch to use the simple BasicBTreeFolder._setOb() for my stress > tests might reveal an increase in speed, and... > > ...it works, acceptably, no less, on my slow laptop for 100,000 objects. It > took ~50 minutes to do this on meager hardware with a 4200 RPM ide disk, and > I figure a bulk add process like this on fast, new hardware (i.e. something > with upwards of 22k pystones and lots of RAM) with a dedicated server for my > RDB, would likely take 1/5th this time, or about 10 minutes (by increasing > both MySQL performance, and Zope performance); combine this with ZEO and > have a dedicated node do this, and I think this is a small amount of proof > of Zope's ability to scale to many objects. (See my caveats at the bottom of > this message, though). > > After days of frustration, I'm actually impressed by what I found: My > data-access APIs are very computationally expensive, since they establish a > MySQLdb cursor object for each call and execute a query; these data access > methods used in bulk adding 100k objects after using _setOb() during > Cataloging via index_object() (the transaction done all in memory for now, > but likely moved to subtransactions soon to support up to 4x that data). > > So far, the moral of the story: use _setOb(), not _setObject() for this many > objects! > > I haven't seen any material documenting anything like this for BTreeFolder, > so I figured I would share with zope-dev what I found in the hopes that > developers creating products with BTreeFolder and/or future implementations > of BTreeFolder might take this into account, in docs, if nothing else. > > Caveats: > - I'm using FileStorage and an old version of Zope (2.3.3). I can't say how > this will perform with Python 2.1/Zope 2.[4/5]. I imagine that one would > want to pack the storage between full rebuilds or have very, very fast > storage hardware. > > - Catalog searches without any limiting queries to indexes will simply be > too slow for practical use with this many objects, so they need to be > forbidden with a permission to prevent accidental over-utilization of system > resources or DOS-style attacks. Otherwise, Catalog searches on my slow hard > drive seem acceptable. > > - I'm not too concerned with BTreeFolder __getattr__() performance > penalties, though I modified BTreeFolder.__getattr__ just in case to remove > the 'if tree and tree.has_key(name)', replacing with try/except; I'm not > sure if this helps/hinders, because my stress-test code uses _getOb() > instead. > > - objectIds() doesn't work; or, more accurately, at first glance, "_.len(objectIds())"> doesn
Re: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...
I'm not sure if this is taken into consideration in your work so far/future plans... but just in case you were unaware, it is not necessary for you to persistently store objects in the ZODB that you intend to index in a ZCatalog. All that is required is that the object to be cataloged is accessible via a URL path. ZSQL methods can be set up to be URL-traversable, and to wrap a class around the returned row. To load the items into the catalog, you can use a PythonScript or similar to loop over a multi-row query, passing the objects directly to the catalog along with a path that matches the one they'll be retrievable from. This approach would eliminate the need for BTreeFolder altogether, although of course it requires access to the RDBMS for retrievals. This should reduce the number of writes and allow for bigger subtransactions in a given quantity of memory. At 07:36 PM 12/9/01 -0800, [EMAIL PROTECTED] wrote: >Interesting FYI for those looking to support lots of cataloged objects in >ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k >Cataloged objects (customer database) in a single BTreeFolder-derived >container; these objects are 'proxy' objects which each expose a single >record in a relational dataset, and allow about 8 fields to be indexed (2 of >which, TextIndexes). > >... > >- Also, I want to make it clear that if I had a data access API that needed >more than simple information about my datasets (i.e. I was trying to do >reporting on patterns, like CRM-ish types of applications), I would likely >wrap a function around indexes done in the RDB, not in Catalog. My requires >no reporting functionality, and thus really needs no indexes, other than for >finding a record for customer service purposes and account validation >purposes. The reason, however, that I chose ZCatalog was for full text >indexing that I could control/hack/customize easily. My slightly uninformed >belief now is that for big datasets or "enterprise" applications (whatever >that means), I would use a hybrid set of (faster) indexes using the RDB's >indexes where appropriate (heavily queried fields), and ZCatalog for >TextIndexes (convenient). I'm sure inevitable improvements to ZCatalog >(there seems to be community interest in such) will help here. ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
[Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...
Interesting FYI for those looking to support lots of cataloged objects in ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k Cataloged objects (customer database) in a single BTreeFolder-derived container; these objects are 'proxy' objects which each expose a single record in a relational dataset, and allow about 8 fields to be indexed (2 of which, TextIndexes). Some informal stress tests using 100k+ _Cataloged_ objects in a BTreeFolder in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be successful, but not without some stubborn investigation and a few caveats. BTreeFolder, using ObjectManager APIs, frankly, just won't scale for bulk-adds of objects to folders. I was adding CatalogAware objects to my folder (including index_object()). After waiting for bulk-add processes to finish after running for 2 days, I killed Zope and started trying to optimize, figuring that the problem was related to Catalog and my own RDB access code, and got nowhere (well, I tuned my app, bu this didn't solve my problem), then went to #zope, got a few ideas, and ended up with the conclusion that my problem was not Catalog-related, but related to BTreeFolder; I initially thought it was a problem with the C-Based generic BTree implementation scaling well past 10k objects, but felt I couldn't point the finger at that before some more basic stuff was ruled out. The easiest thing to do in this case, was to figure out what was heavily accessing the BTree via its dictionary-like interface, and the thought occurred to me that there might be multiple has_key checks, security stuff, and the like called by ObjectManager._setObject(), and I was right. I figured a switch to use the simple BasicBTreeFolder._setOb() for my stress tests might reveal an increase in speed, and... ...it works, acceptably, no less, on my slow laptop for 100,000 objects. It took ~50 minutes to do this on meager hardware with a 4200 RPM ide disk, and I figure a bulk add process like this on fast, new hardware (i.e. something with upwards of 22k pystones and lots of RAM) with a dedicated server for my RDB, would likely take 1/5th this time, or about 10 minutes (by increasing both MySQL performance, and Zope performance); combine this with ZEO and have a dedicated node do this, and I think this is a small amount of proof of Zope's ability to scale to many objects. (See my caveats at the bottom of this message, though). After days of frustration, I'm actually impressed by what I found: My data-access APIs are very computationally expensive, since they establish a MySQLdb cursor object for each call and execute a query; these data access methods used in bulk adding 100k objects after using _setOb() during Cataloging via index_object() (the transaction done all in memory for now, but likely moved to subtransactions soon to support up to 4x that data). So far, the moral of the story: use _setOb(), not _setObject() for this many objects! I haven't seen any material documenting anything like this for BTreeFolder, so I figured I would share with zope-dev what I found in the hopes that developers creating products with BTreeFolder and/or future implementations of BTreeFolder might take this into account, in docs, if nothing else. Caveats: - I'm using FileStorage and an old version of Zope (2.3.3). I can't say how this will perform with Python 2.1/Zope 2.[4/5]. I imagine that one would want to pack the storage between full rebuilds or have very, very fast storage hardware. - Catalog searches without any limiting queries to indexes will simply be too slow for practical use with this many objects, so they need to be forbidden with a permission to prevent accidental over-utilization of system resources or DOS-style attacks. Otherwise, Catalog searches on my slow hard drive seem acceptable. - I'm not too concerned with BTreeFolder __getattr__() performance penalties, though I modified BTreeFolder.__getattr__ just in case to remove the 'if tree and tree.has_key(name)', replacing with try/except; I'm not sure if this helps/hinders, because my stress-test code uses _getOb() instead. - objectIds() doesn't work; or, more accurately, at first glance, doesn't work; I haven't tested anything else. I would like to find out why this is, and fix it. I suppose that there is something done in ObjectManager that BTreeFolder's simple _setOb() doesn't do. If anyone wants to help me figure out the obvious here, I'd appreciate it. ;) - I don't think un-indexed access of records is likely to be very practical with this many, esp. if things like objectIds() are broken, which increases the value of Catalog, and I think that what my experiences here with this project are showing is that Catalog indexing isn't as expensive/slow as I initially thought it would be. That said, I'm sure there can be improvements in Catalog as often is discussed here recently, but for now, I think I'm happy. :) - I Haven't compared these results with OFS.Folder.Fold