Hi everyone (I just registered for the list and resent so as to not bug the

I recently switched over some of my home-rolled sqlite backed object
databases into ZODB based on what I'd read and some cool performance numbers
I'd seen.  I'm really happy with the entire system so far except for one
really irritating problem: memory usage.

I'm doing a rather intensive operation where I'm inverting a mapping of the
form (docid => [wordid]) for about 3 million documents (for about 8 million
unique words).  I thought about doing it on hadoop, but it's a one time
thing and it'd be nice if I didn't have to load the data back into an object
database for my application at the end anyway.

Anyhoo, in the process of this operation (which performs much faster than my
sqlite+python cache solution) memory usage never really drops.  I'm
currently doing a commit every 25k documents.   The python process just
gobbles up RAM, though.  I made it through 750k documents before my 8GB
Ubuntu 10.04 server choked and killed the process (at about 80 percent mem
usage).  (The same thing happens on Windows and OSX, btw).

I figure either there's a really tremendous bug in ZODB (unlikely given its
age and venerability) or I'm really doing it wrong.  Here's my code:

        self.storage = FileStorage(self.dbfile, pack_keep_old=False)
        cache_size = 512 * 1024 * 1024

        self.db = DB(self.storage, pool_size=1, cache_size_bytes=cache_size,
historical_cache_size_bytes=cache_size, database_name=self.name)
        self.connection = self.db.open()
        self.root = self.connection.root()

and the actual insertions...

            set_default = wordid_to_docset.root.setdefault #i can be kinda
pathological with loop operations
            array_append = array.append
            for docid, wordset in docid_to_wordset.iteritems(): #one of my
older sqlite oodb's, not maintaining a cache...just iterating (small
constant mem usage)
                for wordid in wordset:
                    docset = set_default(wordid, array('L'))
                    array_append(docset, docid)

                n_docs_traversed += 1
                if n_docs_traversed % 1000 == 1:
                if n_docs_traversed % 25000 == 1:
                    self.do_commit() #just commits the oodb by calling

The DB on the choked process is perfectly good up to the last commit when it
choked, and I've even tried extremely small values of cache_size_bytes and
cache_size, just to see if I can get it to stop allocating memory and
nothing seems to work.  I've also used string values ('128mb') for
cache-size-bytes, etc.

Can somebody help me out?


Ryan Noon
Stanford Computer Science
BS '09, MS '10
For more information about ZODB, see the ZODB Wiki:

ZODB-Dev mailing list  -  ZODB-Dev@zope.org

Reply via email to