I converted my code to use LOBTrees holding LLTreeSets and it sticks to the
memory bounds and performs admirably throughout the whole process.
Unfortunately opening the database afterwards seems to be really really
slow. Here's what I'm doing:
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
storage = FileStorage('attempt3_wordid_to_docset',pack_keep_old=False)
I think the file in question is about 7 GB in size. It's using 100 percent
of a core and I've never seen it get past the FileStorage object creation.
Is there something I'm doing wrong when I initially fill this storage that
makes it so hard to index, or is there something wrong with the way I'm
creating the new FileStorage?
Thanks for everything, you guys have really been great.
On Wed, May 12, 2010 at 3:48 AM, Jim Fulton <j...@zope.com> wrote:
> On Tue, May 11, 2010 at 7:37 PM, Ryan Noon <rmn...@gmail.com> wrote:
> > Hi Jim,
> > I'm really sorry for the miscommunication, I thought I made that clear in
> > last email:
> > "I'm wrapping ZODB in a 'ZMap' class that just forwards all the
> > methods to the ZODB root and allows easy interchangeability with my old
> > sqlite OODB abstraction."
> Perhaps I should have picked up on this, but it wasn't clear that you
> were refering to word_id_docset. I couldn't see that in the code and I
> didn't get an answer to my question.
> > wordid_to_docset is a "ZMap", which just wraps the ZODB
> > boilerplate/connection and forwards dictionary methods to the root.
> This is the last piece to the puzzle. The root object is a persistent
> mapping object that is a single database object and is thus not a
> scalable data structure. As Lawrence pointed out, this, together with
> the fact that you're using non-persistent arrays as mapping values
> means that all your data is in a single object.
> > but I'm still sorta worried because in my experimentation with ZODB
> > so far I've never been able to observe it sticking to any cache limits,
> > matter how often I tell it to garbage collect (even when storing very
> > values that should give it adequate granularity...see my experiment at
> > end of my last email).
> The unit of granularity is the persistent object. It is persitent
> object that are managed by the cache, not indivdual Python objects
> like strings. If your entire database is in a single persistent
> object, then you're entire database will be in memory.
> If you want a scallable mapping and your keys are stabley ordered (as
> are strings and numbers) then you should use a BTree. BTrees spread
> there data over multiple data records, so you can have massive
> mappings without storing massive amounts of data in memory.
> If you want a set and the items are stabley ordered, then a TreeSet
> (or a Set if the set is known to be small.)
> There are build-in BTrees and sets that support compact storage of
> signed 32-bit or 64-bit ints.
> Jim Fulton
Stanford Computer Science
BS '09, MS '10
For more information about ZODB, see the ZODB Wiki:
ZODB-Dev mailing list - ZODB-Dev@zope.org