I think that moving to an LLTreeSet for the docset will significantly reduce your memory usage. Non persistent objects are stored as part of their parent persistent object's record. Each LOBTree object bucket contains up to 60 (key, value) pairs. When the values are non-persistent objects they are stored as part of the bucket object's record, and so accessing any key of a bucket in a transaction brings up to 60 docsets into memory. I would not be surprised if your program forces most of your data into memory each batch - as most words are in most documents.
At the very least you should move to an LLSet (essentially a single BTree bucket). An LLTreeSet has the additional advantage of being scalable to many values, and if under load from multiple clients you are far less likely to see conflicts. Laurence On 11 May 2010 01:20, Ryan Noon <rmn...@gmail.com> wrote: > P.S. About the data structures: > wordset is a freshly unpickled python set from my old sqlite oodb thingy. > The new docsets I'm keeping are 'L' arrays from the stdlib array module. > I'm up for using ZODB's builtin persistent data structures if it makes a > lot of sense to do so, but it sorta breaks my abstraction a bit and I feel > like the memory issues I'm having are somewhat independent of the container > data structures (as I'm having the same issue just with fixed size strings). > Thanks! > -Ryan > > On Mon, May 10, 2010 at 5:16 PM, Ryan Noon <rmn...@gmail.com> wrote: >> >> Hi all, >> I've incorporated everybody's advice, but I still can't get memory to obey >> cache-size-bytes. I'm using the new 3.10 from pypi (but the same behavior >> happens on the server where I was using 3.10 from the new lucid apt repos). >> I'm going through a mapping where we take one long integer "docid" and map >> it to a collection of long integers ("wordset") and trying to invert it into >> a mapping for each '"wordid" in those wordsets to a set of the original >> docids ("docset"). >> I've even tried calling cacheMinimize after every single docset append, >> but reported memory to the OS never goes down and the process continues to >> allocate like crazy. >> I'm wrapping ZODB in a "ZMap" class that just forwards all the dictionary >> methods to the ZODB root and allows easy interchangeability with my old >> sqlite OODB abstraction. >> Here's the latest version of my code, (minorly instrumented...see below): >> try: >> max_docset_size = 0 >> for docid, wordset in docid_to_wordset.iteritems(): >> for wordid in wordset: >> if wordid_to_docset.has_key(wordid): >> docset = wordid_to_docset[wordid] >> else: >> docset = array('L') >> docset.append(docid) >> if len(docset) > max_docset_size: >> max_docset_size = len(docset) >> print 'Max docset is now %d (owned by wordid %d)' >> % (max_docset_size, wordid) >> wordid_to_docset[wordid] = docset >> wordid_to_docset.garbage_collect() >> wordid_to_docset.connection.cacheMinimize() >> >> n_docs_traversed += 1 >> >> if n_docs_traversed % 100 == 1: >> status_tick() >> if n_docs_traversed % 50000 == 1: >> self.do_commit() >> >> self.do_commit() >> except KeyboardInterrupt, ex: >> self.log_write('Caught keyboard interrupt, committing...') >> self.do_commit() >> I'm keeping track of the greatest docset (which would be the largest >> possible thing not able to be paged out) and its only 10,152 longs (at 8 >> bytes each according to the array module's documentation) at the point 75 >> seconds into the operation when the process has allocated 224 MB (on a >> cache_size_bytes of 64*1024*1024). >> >> On a lark I just made an empty ZMap in the interpreter and filled it with >> 1M unique strings. It took up something like 190mb. I committed it and mem >> usage went up to 420mb. I then ran cacheMinimize (memory stayed at 420mb). >> Then I inserted another 1M entries (strings keyed on ints) and mem usage >> went up to 820mb. Then I committed and memory usage dropped to ~400mb and >> went back up to 833mb. Then I ran cacheMinimize again and memory usage >> stayed there. Does this example (totally decoupled from any other >> operations by me) make sense to experienced ZODB people? I have really no >> functional mental model of ZODB's memory usage patterns. I love using it, >> but I really want to find some way to get its allocations under control. >> I'm currently running this on a Macbook Pro, but it seems to be behaving >> the same way on Windows and Linux. >> I really appreciate all of the help so far, and if there're any other >> pieces of my code that might help please let me know. >> Cheers, >> Ryan >> On Mon, May 10, 2010 at 3:18 PM, Jim Fulton <j...@zope.com> wrote: >>> >>> On Mon, May 10, 2010 at 5:39 PM, Ryan Noon <rmn...@gmail.com> wrote: >>> > First off, thanks everybody. I'm implementing and testing the >>> > suggestions >>> > now. When I said ZODB was more complicated than my solution I meant >>> > that >>> > the system was abstracting a lot more from me than my old code (because >>> > I >>> > wrote it and new exactly how to make the cache enforce its limits!). >>> > >>> >> > The first thing to understand is that options like cache-size and >>> >> > cache-size bytes are suggestions, not limits. :) In particular, >>> >> > they >>> >> > are only enforced: >>> >> > >>> >> > - at transaction boundaries, >>> > >>> > If it's already being called at transaction boundaries how come memory >>> > usage >>> > doesn't go back down to the quota after the commit (which is only every >>> > 25k >>> > documents?). >>> >>> Because Python generally doesn't return memory back to the OS. :) >>> >>> It's also possible you have a problem with one of your data >>> structures. For example if you have an array that grows effectively >>> without bound, the array will have to be in memory, no matter how big >>> it is. Also, if the persistent object holding the array isn't seen as >>> changed, because you're appending to the array, then the size of the >>> array won't be reflected in the cache size. (The size of objects in >>> the cache is estimated from their pickle sizes.) >>> >>> I assume you're using ZODB 3.9.5 or later. If not, there's a bug in >>> handling new objects that prevents cache suggestions from working >>> properly. >>> >>> If you don't need list semantics, and set semantics will do, you might >>> consider using an BTrees.LLBtree.TreeSet, which provides compact >>> scalable persistent sets. (If your word ids can be signed, you could >>> ise the IIBTree variety, which is more compact.) Given the variable >>> name is wordset, then I assume you're dealing with sets. :) >>> >>> What is wordid_to_docset? You don't show it's creation. >>> >>> Jim >>> >>> -- >>> Jim Fulton >> >> >> >> -- >> Ryan Noon >> Stanford Computer Science >> BS '09, MS '10 > > > > -- > Ryan Noon > Stanford Computer Science > BS '09, MS '10 > > _______________________________________________ > For more information about ZODB, see the ZODB Wiki: > http://www.zope.org/Wikis/ZODB/ > > ZODB-Dev mailing list - zodb-...@zope.org > https://mail.zope.org/mailman/listinfo/zodb-dev > > _______________________________________________ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev