Excellent analysis, many thanks Sean!  This is much-needed info for
people whom are attempting to scale.

----- Original Message -----
Sent: Sunday, December 09, 2001 10:36 PM
Subject: [Zope-dev] 100k+ objects, or...Improving Performance of

> Interesting FYI for those looking to support lots of cataloged
objects in
> ZODB and Zope (Chris W., et al)... I'm working on a project to put
> Cataloged objects (customer database) in a single
> container; these objects are 'proxy' objects which each expose a
> record in a relational dataset, and allow about 8 fields to be
indexed (2 of
> which, TextIndexes).
> Some informal stress tests using 100k+ _Cataloged_ objects in a
> in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be
successful, but
> not without some stubborn investigation and a few caveats.
> BTreeFolder, using ObjectManager APIs, frankly, just won't scale for
> bulk-adds of objects to folders.  I was adding CatalogAware objects
to my
> folder (including index_object()). After waiting for bulk-add
processes to
> finish after running for 2 days, I killed Zope and started trying to
> optimize, figuring that the problem was related to Catalog and my
own RDB
> access code, and got nowhere (well, I tuned my app, bu this didn't
solve my
> problem), then went to #zope, got a few ideas, and ended up with the
> conclusion that my problem was not Catalog-related, but related to
> BTreeFolder; I initially thought it was a problem with the C-Based
> BTree implementation scaling well past 10k objects, but felt I
> point the finger at that before some more basic stuff was ruled out.
> The easiest thing to do in this case, was to figure out what was
> accessing the BTree via its dictionary-like interface, and the
> occurred to me that there might be multiple has_key checks, security
> and the like called by ObjectManager._setObject(), and I was right.
> figured a switch to use the simple BasicBTreeFolder._setOb() for my
> tests might reveal an increase in speed, and...
> ...it works, acceptably, no less, on my slow laptop for 100,000
objects.  It
> took ~50 minutes to do this on meager hardware with a 4200 RPM ide
disk, and
> I figure a bulk add process like this on fast, new hardware (i.e.
> with upwards of 22k pystones and lots of RAM) with a dedicated
server for my
> RDB, would likely take 1/5th this time, or about 10 minutes (by
> both MySQL performance, and Zope performance); combine this with ZEO
> have a dedicated node do this, and I think this is a small amount of
> of Zope's ability to scale to many objects. (See my caveats at the
bottom of
> this message, though).
> After days of frustration, I'm actually impressed by what I found:
> data-access APIs are very computationally expensive, since they
establish a
> MySQLdb cursor object for each call and execute a query; these data
> methods used in bulk adding 100k objects after using _setOb() during
> Cataloging via index_object() (the transaction done all in memory
for now,
> but likely moved to subtransactions soon to support up to 4x that
> So far, the moral of the story: use _setOb(), not _setObject() for
this many
> objects!
> I haven't seen any material documenting anything like this for
> so I figured I would share with zope-dev what I found in the hopes
> developers creating products with BTreeFolder and/or future
> of BTreeFolder might take this into account, in docs, if nothing
> Caveats:
> - I'm using FileStorage and an old version of Zope (2.3.3).  I can't
say how
> this will perform with Python 2.1/Zope 2.[4/5].  I imagine that one
> want to pack the storage between full rebuilds or have very, very
> storage hardware.
> - Catalog searches without any limiting queries to indexes will
simply be
> too slow for practical use with this many objects, so they need to
> forbidden with a permission to prevent accidental over-utilization
of system
> resources or DOS-style attacks.  Otherwise, Catalog searches on my
slow hard
> drive seem acceptable.
> - I'm not too concerned with BTreeFolder __getattr__() performance
> penalties, though I modified BTreeFolder.__getattr__ just in case to
> the 'if tree and tree.has_key(name)', replacing with try/except; I'm
> sure if this helps/hinders, because my stress-test code uses
> instead.
> - objectIds() doesn't work; or, more accurately, at first glance,
> "_.len(objectIds())"> doesn't work; I haven't tested anything else.
I would
> like to find out why this is, and fix it.  I suppose that there is
> done in ObjectManager that BTreeFolder's simple _setOb() doesn't do.
> anyone wants to help me figure out the obvious here, I'd appreciate
it. ;)
> - I don't think un-indexed access of records is likely to be very
> with this many, esp. if things like objectIds() are broken, which
> the value of Catalog, and I think that what my experiences here with
> project are showing is that Catalog indexing isn't as expensive/slow
as I
> initially thought it would be.  That said, I'm sure there can be
> improvements in Catalog as often is discussed here recently, but for
now, I
> think I'm happy. :)
> - I Haven't compared these results with OFS.Folder.Folder yet.  I'm
> lazy/busy to comparison test.
> - I'm relatively sure that, in my app, the text index BTrees in the
> are very 'bushy' (more so than normal) because I am indexing
people's full
> names, and street addresses, which means there are less common words
> indexing, say, an every-day document.
> - Also, I want to make it clear that if I had a data access API that
> more than simple information about my datasets (i.e. I was trying to
> reporting on patterns, like CRM-ish types of applications), I would
> wrap a function around indexes done in the RDB, not in Catalog.  My
> no reporting functionality, and thus really needs no indexes, other
than for
> finding a record for customer service purposes and account
> purposes.  The reason, however, that I chose ZCatalog was for full
> indexing that I could control/hack/customize easily.  My slightly
> belief now is that for big datasets or "enterprise" applications
> that means), I would use a hybrid set of (faster) indexes using the
> indexes where appropriate (heavily queried fields), and ZCatalog for
> TextIndexes (convenient).   I'm sure inevitable improvements to
> (there seems to be community interest in such) will help here.
> - I wonder if "directory-storage" combined with ReiserFS might make
for an
> interesting future ZODB choice for this sort of app.
> Sean
> =========================
> Sean Upton
> Senior Programmer/Analyst
> SignOnSanDiego.com
> The San Diego Union-Tribune
> 619.718.5241
> =========================
> _______________________________________________
> Zope-Dev maillist  -  [EMAIL PROTECTED]
> http://lists.zope.org/mailman/listinfo/zope-dev
> **  No cross posts or HTML encoding!  **
> (Related lists -
>  http://lists.zope.org/mailman/listinfo/zope-announce
>  http://lists.zope.org/mailman/listinfo/zope )

Zope-Dev maillist  -  [EMAIL PROTECTED]
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope )

Reply via email to