Re: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...

2001-12-10 Thread Phillip J. Eby

At 04:08 PM 12/10/01 +, Tony McDonald wrote:
>On 10/12/01 2:54 pm, "Phillip J. Eby" <[EMAIL PROTECTED]> wrote:
>
> > I'm not sure if this is taken into consideration in your work so far/future
> > plans...  but just in case you were unaware, it is not necessary for you to
> > persistently store objects in the ZODB that you intend to index in a
> > ZCatalog.  All that is required is that the object to be cataloged is
> > accessible via a URL path.  ZSQL methods can be set up to be
> > URL-traversable, and to wrap a class around the returned row.  To load the
> > items into the catalog, you can use a PythonScript or similar to loop over
> > a multi-row query, passing the objects directly to the catalog along with a
> > path that matches the one they'll be retrievable from.  This approach would
> > eliminate the need for BTreeFolder altogether, although of course it
> > requires access to the RDBMS for retrievals.  This should reduce the number
> > of writes and allow for bigger subtransactions in a given quantity of 
> memory.
>
>Gad! - are you saying you don't need to store a 1Mb .doc file into the ZODB,
>but can still index the thing, store the index information in the Zcatalog
>(presumably a lot smaller than 1Mb) and have the actual file accessible from
>a file system URL? If so, that's really neat!

Yep.  By "URL path", though, I meant a *Zope* path.  However it would be 
straightforward to create a Zope object that represents a filesystem path 
and does traversal/retrieval, assuming that one of the 'FS'-products out 
there doesn't already do this for you.

Chris Withers has pointed out that technically you don't even need the path 
string to be valid, it just has to be unique.  However, the standard tools 
and the method for getting the "real object" referred to by the catalog 
record do expect it to be a valid path IIRC.  I personally find it most 
convenient, therefore, to use a real Zope path.


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



Re: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...

2001-12-10 Thread Chris McDonough

Excellent analysis, many thanks Sean!  This is much-needed info for
people whom are attempting to scale.

- Original Message -
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Sunday, December 09, 2001 10:36 PM
Subject: [Zope-dev] 100k+ objects, or...Improving Performance of
BTreeFolder...


> Interesting FYI for those looking to support lots of cataloged
objects in
> ZODB and Zope (Chris W., et al)... I'm working on a project to put
~350k
> Cataloged objects (customer database) in a single
BTreeFolder-derived
> container; these objects are 'proxy' objects which each expose a
single
> record in a relational dataset, and allow about 8 fields to be
indexed (2 of
> which, TextIndexes).
>
> Some informal stress tests using 100k+ _Cataloged_ objects in a
BTreeFolder
> in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be
successful, but
> not without some stubborn investigation and a few caveats.
>
> BTreeFolder, using ObjectManager APIs, frankly, just won't scale for
> bulk-adds of objects to folders.  I was adding CatalogAware objects
to my
> folder (including index_object()). After waiting for bulk-add
processes to
> finish after running for 2 days, I killed Zope and started trying to
> optimize, figuring that the problem was related to Catalog and my
own RDB
> access code, and got nowhere (well, I tuned my app, bu this didn't
solve my
> problem), then went to #zope, got a few ideas, and ended up with the
> conclusion that my problem was not Catalog-related, but related to
> BTreeFolder; I initially thought it was a problem with the C-Based
generic
> BTree implementation scaling well past 10k objects, but felt I
couldn't
> point the finger at that before some more basic stuff was ruled out.
>
> The easiest thing to do in this case, was to figure out what was
heavily
> accessing the BTree via its dictionary-like interface, and the
thought
> occurred to me that there might be multiple has_key checks, security
stuff,
> and the like called by ObjectManager._setObject(), and I was right.
I
> figured a switch to use the simple BasicBTreeFolder._setOb() for my
stress
> tests might reveal an increase in speed, and...
>
> ...it works, acceptably, no less, on my slow laptop for 100,000
objects.  It
> took ~50 minutes to do this on meager hardware with a 4200 RPM ide
disk, and
> I figure a bulk add process like this on fast, new hardware (i.e.
something
> with upwards of 22k pystones and lots of RAM) with a dedicated
server for my
> RDB, would likely take 1/5th this time, or about 10 minutes (by
increasing
> both MySQL performance, and Zope performance); combine this with ZEO
and
> have a dedicated node do this, and I think this is a small amount of
proof
> of Zope's ability to scale to many objects. (See my caveats at the
bottom of
> this message, though).
>
> After days of frustration, I'm actually impressed by what I found:
My
> data-access APIs are very computationally expensive, since they
establish a
> MySQLdb cursor object for each call and execute a query; these data
access
> methods used in bulk adding 100k objects after using _setOb() during
> Cataloging via index_object() (the transaction done all in memory
for now,
> but likely moved to subtransactions soon to support up to 4x that
data).
>
> So far, the moral of the story: use _setOb(), not _setObject() for
this many
> objects!
>
> I haven't seen any material documenting anything like this for
BTreeFolder,
> so I figured I would share with zope-dev what I found in the hopes
that
> developers creating products with BTreeFolder and/or future
implementations
> of BTreeFolder might take this into account, in docs, if nothing
else.
>
> Caveats:
> - I'm using FileStorage and an old version of Zope (2.3.3).  I can't
say how
> this will perform with Python 2.1/Zope 2.[4/5].  I imagine that one
would
> want to pack the storage between full rebuilds or have very, very
fast
> storage hardware.
>
> - Catalog searches without any limiting queries to indexes will
simply be
> too slow for practical use with this many objects, so they need to
be
> forbidden with a permission to prevent accidental over-utilization
of system
> resources or DOS-style attacks.  Otherwise, Catalog searches on my
slow hard
> drive seem acceptable.
>
> - I'm not too concerned with BTreeFolder __getattr__() performance
> penalties, though I modified BTreeFolder.__getattr__ just in case to
remove
> the 'if tree and tree.has_key(name)', replacing with try/except; I'm
not
> sure if this helps/hinders, because my stress-test code uses
_getOb()
> instead.
>
> - objectIds() doesn't work; or, more accurately, at first glance,
 "_.len(objectIds())"> doesn&#x

Re: [Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...

2001-12-10 Thread Phillip J. Eby

I'm not sure if this is taken into consideration in your work so far/future 
plans...  but just in case you were unaware, it is not necessary for you to 
persistently store objects in the ZODB that you intend to index in a 
ZCatalog.  All that is required is that the object to be cataloged is 
accessible via a URL path.  ZSQL methods can be set up to be 
URL-traversable, and to wrap a class around the returned row.  To load the 
items into the catalog, you can use a PythonScript or similar to loop over 
a multi-row query, passing the objects directly to the catalog along with a 
path that matches the one they'll be retrievable from.  This approach would 
eliminate the need for BTreeFolder altogether, although of course it 
requires access to the RDBMS for retrievals.  This should reduce the number 
of writes and allow for bigger subtransactions in a given quantity of memory.


At 07:36 PM 12/9/01 -0800, [EMAIL PROTECTED] wrote:
>Interesting FYI for those looking to support lots of cataloged objects in
>ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k
>Cataloged objects (customer database) in a single BTreeFolder-derived
>container; these objects are 'proxy' objects which each expose a single
>record in a relational dataset, and allow about 8 fields to be indexed (2 of
>which, TextIndexes).
>
>...
>
>- Also, I want to make it clear that if I had a data access API that needed
>more than simple information about my datasets (i.e. I was trying to do
>reporting on patterns, like CRM-ish types of applications), I would likely
>wrap a function around indexes done in the RDB, not in Catalog.  My requires
>no reporting functionality, and thus really needs no indexes, other than for
>finding a record for customer service purposes and account validation
>purposes.  The reason, however, that I chose ZCatalog was for full text
>indexing that I could control/hack/customize easily.  My slightly uninformed
>belief now is that for big datasets or "enterprise" applications (whatever
>that means), I would use a hybrid set of (faster) indexes using the RDB's
>indexes where appropriate (heavily queried fields), and ZCatalog for
>TextIndexes (convenient).   I'm sure inevitable improvements to ZCatalog
>(there seems to be community interest in such) will help here.


___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope )



[Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...

2001-12-09 Thread sean . upton

Interesting FYI for those looking to support lots of cataloged objects in
ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k
Cataloged objects (customer database) in a single BTreeFolder-derived
container; these objects are 'proxy' objects which each expose a single
record in a relational dataset, and allow about 8 fields to be indexed (2 of
which, TextIndexes).

Some informal stress tests using 100k+ _Cataloged_ objects in a BTreeFolder
in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be successful, but
not without some stubborn investigation and a few caveats.  

BTreeFolder, using ObjectManager APIs, frankly, just won't scale for
bulk-adds of objects to folders.  I was adding CatalogAware objects to my
folder (including index_object()). After waiting for bulk-add processes to
finish after running for 2 days, I killed Zope and started trying to
optimize, figuring that the problem was related to Catalog and my own RDB
access code, and got nowhere (well, I tuned my app, bu this didn't solve my
problem), then went to #zope, got a few ideas, and ended up with the
conclusion that my problem was not Catalog-related, but related to
BTreeFolder; I initially thought it was a problem with the C-Based generic
BTree implementation scaling well past 10k objects, but felt I couldn't
point the finger at that before some more basic stuff was ruled out.  

The easiest thing to do in this case, was to figure out what was heavily
accessing the BTree via its dictionary-like interface, and the thought
occurred to me that there might be multiple has_key checks, security stuff,
and the like called by ObjectManager._setObject(), and I was right. I
figured a switch to use the simple BasicBTreeFolder._setOb() for my stress
tests might reveal an increase in speed, and...

...it works, acceptably, no less, on my slow laptop for 100,000 objects.  It
took ~50 minutes to do this on meager hardware with a 4200 RPM ide disk, and
I figure a bulk add process like this on fast, new hardware (i.e. something
with upwards of 22k pystones and lots of RAM) with a dedicated server for my
RDB, would likely take 1/5th this time, or about 10 minutes (by increasing
both MySQL performance, and Zope performance); combine this with ZEO and
have a dedicated node do this, and I think this is a small amount of proof
of Zope's ability to scale to many objects. (See my caveats at the bottom of
this message, though).

After days of frustration, I'm actually impressed by what I found: My
data-access APIs are very computationally expensive, since they establish a
MySQLdb cursor object for each call and execute a query; these data access
methods used in bulk adding 100k objects after using _setOb() during
Cataloging via index_object() (the transaction done all in memory for now,
but likely moved to subtransactions soon to support up to 4x that data). 

So far, the moral of the story: use _setOb(), not _setObject() for this many
objects!

I haven't seen any material documenting anything like this for BTreeFolder,
so I figured I would share with zope-dev what I found in the hopes that
developers creating products with BTreeFolder and/or future implementations
of BTreeFolder might take this into account, in docs, if nothing else.

Caveats:
- I'm using FileStorage and an old version of Zope (2.3.3).  I can't say how
this will perform with Python 2.1/Zope 2.[4/5].  I imagine that one would
want to pack the storage between full rebuilds or have very, very fast
storage hardware.

- Catalog searches without any limiting queries to indexes will simply be
too slow for practical use with this many objects, so they need to be
forbidden with a permission to prevent accidental over-utilization of system
resources or DOS-style attacks.  Otherwise, Catalog searches on my slow hard
drive seem acceptable. 

- I'm not too concerned with BTreeFolder __getattr__() performance
penalties, though I modified BTreeFolder.__getattr__ just in case to remove
the 'if tree and tree.has_key(name)', replacing with try/except; I'm not
sure if this helps/hinders, because my stress-test code uses _getOb()
instead.

- objectIds() doesn't work; or, more accurately, at first glance,  doesn't work; I haven't tested anything else.  I would
like to find out why this is, and fix it.  I suppose that there is something
done in ObjectManager that BTreeFolder's simple _setOb() doesn't do.  If
anyone wants to help me figure out the obvious here, I'd appreciate it. ;)

- I don't think un-indexed access of records is likely to be very practical
with this many, esp. if things like objectIds() are broken, which increases
the value of Catalog, and I think that what my experiences here with this
project are showing is that Catalog indexing isn't as expensive/slow as I
initially thought it would be.  That said, I'm sure there can be
improvements in Catalog as often is discussed here recently, but for now, I
think I'm happy. :)  

- I Haven't compared these results with OFS.Folder.Fold