Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
On Tue, May 18, 2010 at 1:14 PM, Ryan Noon rmn...@gmail.com wrote: Hi All, I converted my code to use LOBTrees holding LLTreeSets and it sticks to the memory bounds and performs admirably throughout the whole process. Unfortunately opening the database afterwards seems to be really really slow. Here's what I'm doing: from ZODB.FileStorage import FileStorage from ZODB.DB import DB storage = FileStorage('attempt3_wordid_to_docset',pack_keep_old=False) I think the file in question is about 7 GB in size. It's using 100 percent of a core and I've never seen it get past the FileStorage object creation. Is there something I'm doing wrong when I initially fill this storage that makes it so hard to index, or is there something wrong with the way I'm creating the new FileStorage? Is there a 'index' file that is being created? It would be in the same directory as the database file. How are you closing the application? If you see the index file changing when you start up; it is probably rebuilding the index. -alan ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
On Tue, May 11, 2010 at 7:37 PM, Ryan Noon rmn...@gmail.com wrote: ... (a pointer to relevant documentation would be really useful) A major deficiency of ZODB is that there is effectively no standard documentation. I'm working on fixing this. Jim -- Jim Fulton ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
On Tue, May 11, 2010 at 7:37 PM, Ryan Noon rmn...@gmail.com wrote: Hi Jim, I'm really sorry for the miscommunication, I thought I made that clear in my last email: I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction. Perhaps I should have picked up on this, but it wasn't clear that you were refering to word_id_docset. I couldn't see that in the code and I didn't get an answer to my question. wordid_to_docset is a ZMap, which just wraps the ZODB boilerplate/connection and forwards dictionary methods to the root. This is the last piece to the puzzle. The root object is a persistent mapping object that is a single database object and is thus not a scalable data structure. As Lawrence pointed out, this, together with the fact that you're using non-persistent arrays as mapping values means that all your data is in a single object. but I'm still sorta worried because in my experimentation with ZODB so far I've never been able to observe it sticking to any cache limits, no matter how often I tell it to garbage collect (even when storing very small values that should give it adequate granularity...see my experiment at the end of my last email). The unit of granularity is the persistent object. It is persitent object that are managed by the cache, not indivdual Python objects like strings. If your entire database is in a single persistent object, then you're entire database will be in memory. If you want a scallable mapping and your keys are stabley ordered (as are strings and numbers) then you should use a BTree. BTrees spread there data over multiple data records, so you can have massive mappings without storing massive amounts of data in memory. If you want a set and the items are stabley ordered, then a TreeSet (or a Set if the set is known to be small.) There are build-in BTrees and sets that support compact storage of signed 32-bit or 64-bit ints. Jim -- Jim Fulton ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
Hi Jim, I'm really sorry for the miscommunication, I thought I made that clear in my last email: I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction. wordid_to_docset is a ZMap, which just wraps the ZODB boilerplate/connection and forwards dictionary methods to the root. If this seems superfluous, it was just to maintain backwards compatibility with all of the code I'd already written for the sqlite OODB I was using before I switched to ZODB. Whenever you see something like wordid_to_docset[id] it's just doing self.root[id] behind the scenes in a __setitem__ call inside the ZMap class, which I've pasted below. The db is just storing longs mapped to array('L')'s with a few thousand longs in em. I'm going to try switching to the persistent data structure that Laurence suggested (a pointer to relevant documentation would be really useful), but I'm still sorta worried because in my experimentation with ZODB so far I've never been able to observe it sticking to any cache limits, no matter how often I tell it to garbage collect (even when storing very small values that should give it adequate granularity...see my experiment at the end of my last email). If the memory reported to the OS by Python 2.6 is the problem I'd understand, but memory usage goes up the second I start adding new things (which indicates that Python is asking for more and not actually freeing internally, no?). If you feel there's something pathological about my memory access patterns in this operation I can just do the actual inversion step in Hadoop and load the output into ZODB for my application later, I was just hoping to keep all of my data in OODB's the entire time. Thanks again all of you for your collective time. I really like ZODB so far, and it bugs me that I'm likely screwing it up somewhere. Cheers, Ryan class ZMap(object): def __init__(self, name=None, dbfile=None, cache_size_mb=512, autocommit=True): self.name = name self.dbfile = dbfile self.autocommit = autocommit self.__hash__ = None #can't hash this #first things first, figure out if we need to make up a name if self.name == None: self.name = make_up_name() if sep in self.name: if self.name[-1] == sep: self.name = self.name[:-1] self.name = self.name.split(sep)[-1] if self.dbfile == None: self.dbfile = self.name + '.zdb' self.storage = FileStorage(self.dbfile, pack_keep_old=False) self.cache_size = cache_size_mb * 1024 * 1024 self.db = DB(self.storage, pool_size=1, cache_size_bytes=self.cache_size, historical_cache_size_bytes=self.cache_size, database_name=self.name) self.connection = self.db.open() self.root = self.connection.root() print 'Initializing ZMap %s in file %s with %dmb cache. Current %d items' % (self.name, self.dbfile, cache_size_mb, len(self.root)) # basic operators def __eq__(self, y): # x == y return self.root.__eq__(y) def __ge__(self, y): # x = y return len(self) = len(y) def __gt__(self, y): # x y return len(self) len(y) def __le__(self, y): # x = y return not self.__gt__(y) def __lt__(self, y): # x y return not self.__ge__(y) def __len__(self): # len(x) return len(self.root) # dictionary stuff def __getitem__(self, key): # x[key] return self.root[key] def __setitem__(self, key, value): # x[key] = value self.root[key] = value self.__commit_check() # write back if necessary def __delitem__(self, key): # del x[key] del self.root[key] def get(self, key, default=None): # x[key] if key in x, else default return self.root.get(key, default) def has_key(self, key): # True if x has key, else False return self.root.has_key(key) def items(self): # list of key/val pairs return self.root.items() def keys(self): return self.root.keys() def pop(self, key, default=None): return self.root.pop() def popitem(self): #remove and return an arbitrary key/val pair return self.root.popitem() def setdefault(self, key, default=None): #D.setdefault(k[,d]) - D.get(k,d), also set D[k]=d if k not in D return self.root.setdefault(key, default) def values(self): return self.root.values() def copy(self): #copy it? dubiously necessary at the moment NOT_IMPLEMENTED('copy') # iteration def __iter__(self): # iter(x) return self.root.iterkeys() def iteritems(self): #iterator over items, this can be hellaoptimized return self.root.iteritems() def itervalues(self): return self.root.itervalues() def iterkeys(self): return self.root.iterkeys() # practical
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
I think this means that you are storing all of your data in a single persistent object, the database root PersistentMapping. You need to break up your data into persistent objects (instances of objects that inherit from persistent.Persistent) for the ZODB to have a chance of performing memory mapping. You want to do something like: import transaction from ZODB import FileStorage, DB from BTrees.LOBTree import BTree, TreeSet storage = FileStorage.FileStorage('/tmp/test-filestorage.fs') db = DB(storage) conn = db.open() root = conn.root() transaction.begin() index = root['index'] = BTree() values = index[1] = TreeSet() values.add(42) transaction.commit() You should probably read: http://www.zodb.org/documentation/guide/modules.html#btrees-package. Since that was written an L variants of the BTree types have been introduced for storing 64bit integers. I'm using an LOBTree because that maps 64bit integers to python objects. For values I'm using an LOTreeSet, though you could also use an LLTreeSet (which has larger buckets). Laurence On 12 May 2010 00:37, Ryan Noon rmn...@gmail.com wrote: Hi Jim, I'm really sorry for the miscommunication, I thought I made that clear in my last email: I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction. wordid_to_docset is a ZMap, which just wraps the ZODB boilerplate/connection and forwards dictionary methods to the root. If this seems superfluous, it was just to maintain backwards compatibility with all of the code I'd already written for the sqlite OODB I was using before I switched to ZODB. Whenever you see something like wordid_to_docset[id] it's just doing self.root[id] behind the scenes in a __setitem__ call inside the ZMap class, which I've pasted below. The db is just storing longs mapped to array('L')'s with a few thousand longs in em. I'm going to try switching to the persistent data structure that Laurence suggested (a pointer to relevant documentation would be really useful), but I'm still sorta worried because in my experimentation with ZODB so far I've never been able to observe it sticking to any cache limits, no matter how often I tell it to garbage collect (even when storing very small values that should give it adequate granularity...see my experiment at the end of my last email). If the memory reported to the OS by Python 2.6 is the problem I'd understand, but memory usage goes up the second I start adding new things (which indicates that Python is asking for more and not actually freeing internally, no?). If you feel there's something pathological about my memory access patterns in this operation I can just do the actual inversion step in Hadoop and load the output into ZODB for my application later, I was just hoping to keep all of my data in OODB's the entire time. Thanks again all of you for your collective time. I really like ZODB so far, and it bugs me that I'm likely screwing it up somewhere. Cheers, Ryan class ZMap(object): def __init__(self, name=None, dbfile=None, cache_size_mb=512, autocommit=True): self.name = name self.dbfile = dbfile self.autocommit = autocommit self.__hash__ = None #can't hash this #first things first, figure out if we need to make up a name if self.name == None: self.name = make_up_name() if sep in self.name: if self.name[-1] == sep: self.name = self.name[:-1] self.name = self.name.split(sep)[-1] if self.dbfile == None: self.dbfile = self.name + '.zdb' self.storage = FileStorage(self.dbfile, pack_keep_old=False) self.cache_size = cache_size_mb * 1024 * 1024 self.db = DB(self.storage, pool_size=1, cache_size_bytes=self.cache_size, historical_cache_size_bytes=self.cache_size, database_name=self.name) self.connection = self.db.open() self.root = self.connection.root() print 'Initializing ZMap %s in file %s with %dmb cache. Current %d items' % (self.name, self.dbfile, cache_size_mb, len(self.root)) # basic operators def __eq__(self, y): # x == y return self.root.__eq__(y) def __ge__(self, y): # x = y return len(self) = len(y) def __gt__(self, y): # x y return len(self) len(y) def __le__(self, y): # x = y return not self.__gt__(y) def __lt__(self, y): # x y return not self.__ge__(y) def __len__(self): # len(x) return len(self.root) # dictionary stuff def __getitem__(self, key): # x[key] return self.root[key] def __setitem__(self, key, value): # x[key] = value self.root[key] = value self.__commit_check() # write back if necessary def __delitem__(self, key): # del x[key] del self.root[key] def get(self, key,
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
Thanks Laurence, this looks really helpful. The simplicity of ZODB's concept and the joy of using it apparently hides some of the complexity necessary to use it efficiently. I'll check this out when I circle back to data stuff tomorrow. Have a great morning/day/evening! -Ryan On Tue, May 11, 2010 at 5:44 PM, Laurence Rowe l...@lrowe.co.uk wrote: I think this means that you are storing all of your data in a single persistent object, the database root PersistentMapping. You need to break up your data into persistent objects (instances of objects that inherit from persistent.Persistent) for the ZODB to have a chance of performing memory mapping. You want to do something like: import transaction from ZODB import FileStorage, DB from BTrees.LOBTree import BTree, TreeSet storage = FileStorage.FileStorage('/tmp/test-filestorage.fs') db = DB(storage) conn = db.open() root = conn.root() transaction.begin() index = root['index'] = BTree() values = index[1] = TreeSet() values.add(42) transaction.commit() You should probably read: http://www.zodb.org/documentation/guide/modules.html#btrees-package. Since that was written an L variants of the BTree types have been introduced for storing 64bit integers. I'm using an LOBTree because that maps 64bit integers to python objects. For values I'm using an LOTreeSet, though you could also use an LLTreeSet (which has larger buckets). Laurence On 12 May 2010 00:37, Ryan Noon rmn...@gmail.com wrote: Hi Jim, I'm really sorry for the miscommunication, I thought I made that clear in my last email: I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction. wordid_to_docset is a ZMap, which just wraps the ZODB boilerplate/connection and forwards dictionary methods to the root. If this seems superfluous, it was just to maintain backwards compatibility with all of the code I'd already written for the sqlite OODB I was using before I switched to ZODB. Whenever you see something like wordid_to_docset[id] it's just doing self.root[id] behind the scenes in a __setitem__ call inside the ZMap class, which I've pasted below. The db is just storing longs mapped to array('L')'s with a few thousand longs in em. I'm going to try switching to the persistent data structure that Laurence suggested (a pointer to relevant documentation would be really useful), but I'm still sorta worried because in my experimentation with ZODB so far I've never been able to observe it sticking to any cache limits, no matter how often I tell it to garbage collect (even when storing very small values that should give it adequate granularity...see my experiment at the end of my last email). If the memory reported to the OS by Python 2.6 is the problem I'd understand, but memory usage goes up the second I start adding new things (which indicates that Python is asking for more and not actually freeing internally, no?). If you feel there's something pathological about my memory access patterns in this operation I can just do the actual inversion step in Hadoop and load the output into ZODB for my application later, I was just hoping to keep all of my data in OODB's the entire time. Thanks again all of you for your collective time. I really like ZODB so far, and it bugs me that I'm likely screwing it up somewhere. Cheers, Ryan class ZMap(object): def __init__(self, name=None, dbfile=None, cache_size_mb=512, autocommit=True): self.name = name self.dbfile = dbfile self.autocommit = autocommit self.__hash__ = None #can't hash this #first things first, figure out if we need to make up a name if self.name == None: self.name = make_up_name() if sep in self.name: if self.name[-1] == sep: self.name = self.name[:-1] self.name = self.name.split(sep)[-1] if self.dbfile == None: self.dbfile = self.name + '.zdb' self.storage = FileStorage(self.dbfile, pack_keep_old=False) self.cache_size = cache_size_mb * 1024 * 1024 self.db = DB(self.storage, pool_size=1, cache_size_bytes=self.cache_size, historical_cache_size_bytes=self.cache_size, database_name=self.name) self.connection = self.db.open() self.root = self.connection.root() print 'Initializing ZMap %s in file %s with %dmb cache. Current %d items' % (self.name, self.dbfile, cache_size_mb, len(self.root)) # basic operators def __eq__(self, y): # x == y return self.root.__eq__(y) def __ge__(self, y): # x = y return len(self) = len(y) def __gt__(self, y): # x y return len(self) len(y) def __le__(self, y): # x = y return not self.__gt__(y)
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
Thanks for your quick reply! So, the best place to call those would be during my commit break (whenever I decide to take it? [which would be less often if I could be sure of no crashing]). Are there any other problems with the way I was using ZODB in my code? I really like it, but I recognize that it's a lot more complicated than my old system. Cheers, Ryan On Mon, May 10, 2010 at 12:48 PM, Alan Runyan runy...@gmail.com wrote: The DB on the choked process is perfectly good up to the last commit when it choked, and I've even tried extremely small values of cache_size_bytes and cache_size, just to see if I can get it to stop allocating memory and nothing seems to work. I've also used string values ('128mb') for cache-size-bytes, etc. On the connection object there are two methods you want to use: - cacheMinimize This is more of a heavy hand which attempts to deactive *all* non modified objects from cache. - cacheGC This will clean up the internal cache via the cache-byte-size parameter. If you are not calling these (I do not believe they are called in trnx.commit) in your code; then they are probably not being called. cheers alan -- Ryan Noon Stanford Computer Science BS '09, MS '10 ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
So, the best place to call those would be during my commit break (whenever I decide to take it? [which would be less often if I could be sure of no crashing]). Are there any other problems with the way I was using ZODB in my code? I really like it, but I recognize that it's a lot more complicated than my old system. Correct. Pick appropriate place where you are finished with a batch of objects and possibly have them no longer referenced. Then you can call cacheMinimize. You can call cacheGC to reduce memory usage. Your code looks straight forward. I do not see it doing anything strange. It's more complicated than old system? Ugh. That sucks. The ZODB should be LESS complicated than sqlite+custom rolled caching. If not than you may be doing something wrong or ZODB is not living up to its promise *wink* cheers alan ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
On Mon, May 10, 2010 at 3:27 PM, Ryan Noon rmn...@gmail.com wrote: Hi everyone, I recently switched over some of my home-rolled sqlite backed object databases into ZODB based on what I'd read and some cool performance numbers I'd seen. I'm really happy with the entire system so far except for one really irritating problem: memory usage. I'm doing a rather intensive operation where I'm inverting a mapping of the form (docid = [wordid]) for about 3 million documents (for about 8 million unique words). I thought about doing it on hadoop, but it's a one time thing and it'd be nice if I didn't have to load the data back into an object database for my application at the end anyway. Anyhoo, in the process of this operation (which performs much faster than my sqlite+python cache solution) memory usage never really drops. I'm currently doing a commit every 25k documents. The python process just gobbles up RAM, though. I made it through 750k documents before my 8GB Ubuntu 10.04 server choked and killed the process (at about 80 percent mem usage). (The same thing happens on Windows and OSX, btw). I figure either there's a really tremendous bug in ZODB (unlikely given its age and venerability) or I'm really doing it wrong. Here's my code: self.storage = FileStorage(self.dbfile, pack_keep_old=False) cache_size = 512 * 1024 * 1024 self.db = DB(self.storage, pool_size=1, cache_size_bytes=cache_size, historical_cache_size_bytes=cache_size, database_name=self.name) self.connection = self.db.open() self.root = self.connection.root() and the actual insertions... set_default = wordid_to_docset.root.setdefault #i can be kinda pathological with loop operations array_append = array.append for docid, wordset in docid_to_wordset.iteritems(): #one of my older sqlite oodb's, not maintaining a cache...just iterating (small constant mem usage) for wordid in wordset: docset = set_default(wordid, array('L')) array_append(docset, docid) n_docs_traversed += 1 if n_docs_traversed % 1000 == 1: status_tick() if n_docs_traversed % 25000 == 1: self.do_commit() #just commits the oodb by calling transaction.commit() The DB on the choked process is perfectly good up to the last commit when it choked, and I've even tried extremely small values of cache_size_bytes and cache_size, just to see if I can get it to stop allocating memory and nothing seems to work. I've also used string values ('128mb') for cache-size-bytes, etc. Can somebody help me out? The first thing to understand is that options like cache-size and cache-size bytes are suggestions, not limits. :) In particular, they are only enforced: - at transaction boundaries, - when an application creates a savepoint, - or when an application invokes garbage collection explicitly via the cacheGC or cacheMinimize methods. Note that objects that have been monified but not committed won't be freed even if the suggestions are exceeded. The reason that ZODB never frees objects on it's own is that doing so could lead to surprising changes to object state and subtle bugs. Consider: def append(self, item): self._data.append(item) # self._data is just a Python dict # At this point, ZODB doesn't know that self has changed. # If ZODB was willing to free an object whenever it wanted to, # self could be freed here, losing the change to self._data. self._length += 1 # Now self is marked as changed, but too late if self was # freed above. Also note that memory allocated by Python is generally not returned to the OS when freed. Calling cacheGC at transaction boundaries won't buy you anything. It's already called then. :) In your script, I'd recommend calling cacheGC after processing each document: root._p_jar.cacheGC() This will keep the cache full, which will hopefully help performance without letting it grow far out of bounds. Jim -- Jim Fulton ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
On Mon, May 10, 2010 at 4:58 PM, Jim Fulton j...@zope.com wrote: ... The first thing to understand is that options like cache-size and cache-size bytes are suggestions, not limits. :) In particular, they are only enforced: - at transaction boundaries, - when an application creates a savepoint, - or when an application invokes garbage collection explicitly via the cacheGC or cacheMinimize methods. Note that objects that have been monified but not committed won't be freed even if the suggestions are exceeded. The reason that ZODB never frees objects on it's own is that doing so could lead to surprising changes to object state and subtle bugs. Consider: def append(self, item): self._data.append(item) # self._data is just a Python dict # At this point, ZODB doesn't know that self has changed. # If ZODB was willing to free an object whenever it wanted to, # self could be freed here, losing the change to self._data. self._length += 1 # Now self is marked as changed, but too late if self was # freed above. I meant to add that it might be interesting to error or warn when a cache gets much larger than a suggestion. Jim -- Jim Fulton ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jim Fulton wrote: On Mon, May 10, 2010 at 3:27 PM, Ryan Noon rmn...@gmail.com wrote: snip Here's my code: self.storage = FileStorage(self.dbfile, pack_keep_old=False) cache_size = 512 * 1024 * 1024 self.db = DB(self.storage, pool_size=1, cache_size_bytes=cache_size, historical_cache_size_bytes=cache_size, database_name=self.name) self.connection = self.db.open() self.root = self.connection.root() and the actual insertions... set_default = wordid_to_docset.root.setdefault #i can be kinda pathological with loop operations array_append = array.append for docid, wordset in docid_to_wordset.iteritems(): #one of my older sqlite oodb's, not maintaining a cache...just iterating (small constant mem usage) for wordid in wordset: docset = set_default(wordid, array('L')) Note that you are creating the array willy-nilly in the inner loop here. I would nearly always write that as:: docset = wordid_to_docset.root.get(wordid) if docset is None: docset = array('L') wordid_to_docset.root[worid] = docet array_append(docset, docid) Why are you using an unbound method here? The following would be clearer, and almost certainly not noticeably slower: docset.append(docid) n_docs_traversed += 1 if n_docs_traversed % 1000 == 1: status_tick() if n_docs_traversed % 25000 == 1: self.do_commit() #just commits the oodb by calling transaction.commit() Don't forget the final commit. ;) Also, I don't know what the 'array' type is here, but if it doesn't manage its own persistence, then you have a bug here: mutating a non-persistent sub-object doesn't automatically cause the persistent container to register as dirty with the transaction, which means you may lose changes after the object is evicted from the RAM cache, or at shutdown. snip Also note that memory allocated by Python is generally not returned to the OS when freed. Python's own internal heap management has gotten noticeably better about returning reclaimed chunks to the OS in 2.6. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkvoeIoACgkQ+gerLs4ltQ6WjACgsvDmG96nD2iPl/noiHS5ThdL SdIAn1Ei+yfzRyJ4W1lwvuThBj9BxzGt =nrBB -END PGP SIGNATURE- ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
On Mon, May 10, 2010 at 5:20 PM, Tres Seaver tsea...@palladion.com wrote: ... Also note that memory allocated by Python is generally not returned to the OS when freed. Python's own internal heap management has gotten noticeably better about returning reclaimed chunks to the OS in 2.6. Yeah, I've heard that. I tried to verify this, but have since repressed the result. ;) Jim -- Jim Fulton ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
First off, thanks everybody. I'm implementing and testing the suggestions now. When I said ZODB was more complicated than my solution I meant that the system was abstracting a lot more from me than my old code (because I wrote it and new exactly how to make the cache enforce its limits!). The first thing to understand is that options like cache-size and cache-size bytes are suggestions, not limits. :) In particular, they are only enforced: - at transaction boundaries, If it's already being called at transaction boundaries how come memory usage doesn't go back down to the quota after the commit (which is only every 25k documents?). With regards to returning memory to the OS, I don't really care if it reports less, but it really seems like it's overallocating if the OS kills it on an 8GB machine with a 512mb quota. Tres: With your first point: Yeah, I wrote that late last night and I just realized it's getting evaluated stupidly on the setdefault call. I was trying to be cute with Python dict methods that I hadn't used before. Stupid me. With regards to your second point: I read the loop optimization wiki page over at python.org too many times and I get itchy whenever there's method lookup inside of a loop. I need to remember I'm dealing with a database here and IO is gonna be the bottleneck anyway. With regards to your third point: I actually ran into the same change notification problem when I was rolling my own OODB and I assumed ZODB had done something tricky because my changes were showing up upon reopening the db even when I'd done the append and not told ZODB about the change. I'll fix it to make that more explicit...I think the magical effects I'd seen were related to my problem with too much damn caching. The array type is from the stdlib array module that I'm just appending my IDs to as longs. I figured it'd be more compact and would serialize faster. Btw, the final commit is outside the loop. (not shown). =) Cheers, Ryan -- Ryan Noon Stanford Computer Science BS '09, MS '10 ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote: First off, thanks everybody. I'm implementing and testing the suggestions now. When I said ZODB was more complicated than my solution I meant that the system was abstracting a lot more from me than my old code (because I wrote it and new exactly how to make the cache enforce its limits!). The first thing to understand is that options like cache-size and cache-size bytes are suggestions, not limits. :) In particular, they are only enforced: - at transaction boundaries, If it's already being called at transaction boundaries how come memory usage doesn't go back down to the quota after the commit (which is only every 25k documents?). Because Python generally doesn't return memory back to the OS. :) It's also possible you have a problem with one of your data structures. For example if you have an array that grows effectively without bound, the array will have to be in memory, no matter how big it is. Also, if the persistent object holding the array isn't seen as changed, because you're appending to the array, then the size of the array won't be reflected in the cache size. (The size of objects in the cache is estimated from their pickle sizes.) I assume you're using ZODB 3.9.5 or later. If not, there's a bug in handling new objects that prevents cache suggestions from working properly. If you don't need list semantics, and set semantics will do, you might consider using an BTrees.LLBtree.TreeSet, which provides compact scalable persistent sets. (If your word ids can be signed, you could ise the IIBTree variety, which is more compact.) Given the variable name is wordset, then I assume you're dealing with sets. :) What is wordid_to_docset? You don't show it's creation. Jim -- Jim Fulton ___ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
Hi all, I've incorporated everybody's advice, but I still can't get memory to obey cache-size-bytes. I'm using the new 3.10 from pypi (but the same behavior happens on the server where I was using 3.10 from the new lucid apt repos). I'm going through a mapping where we take one long integer docid and map it to a collection of long integers (wordset) and trying to invert it into a mapping for each 'wordid in those wordsets to a set of the original docids (docset). I've even tried calling cacheMinimize after every single docset append, but reported memory to the OS never goes down and the process continues to allocate like crazy. I'm wrapping ZODB in a ZMap class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction. Here's the latest version of my code, (minorly instrumented...see below): try: max_docset_size = 0 for docid, wordset in docid_to_wordset.iteritems(): for wordid in wordset: if wordid_to_docset.has_key(wordid): docset = wordid_to_docset[wordid] else: docset = array('L') docset.append(docid) if len(docset) max_docset_size: max_docset_size = len(docset) print 'Max docset is now %d (owned by wordid %d)' % (max_docset_size, wordid) wordid_to_docset[wordid] = docset wordid_to_docset.garbage_collect() wordid_to_docset.connection.cacheMinimize() n_docs_traversed += 1 if n_docs_traversed % 100 == 1: status_tick() if n_docs_traversed % 5 == 1: self.do_commit() self.do_commit() except KeyboardInterrupt, ex: self.log_write('Caught keyboard interrupt, committing...') self.do_commit() I'm keeping track of the greatest docset (which would be the largest possible thing not able to be paged out) and its only 10,152 longs (at 8 bytes each according to the array module's documentation) at the point 75 seconds into the operation when the process has allocated 224 MB (on a cache_size_bytes of 64*1024*1024). On a lark I just made an empty ZMap in the interpreter and filled it with 1M unique strings. It took up something like 190mb. I committed it and mem usage went up to 420mb. I then ran cacheMinimize (memory stayed at 420mb). Then I inserted another 1M entries (strings keyed on ints) and mem usage went up to 820mb. Then I committed and memory usage dropped to ~400mb and went back up to 833mb. Then I ran cacheMinimize again and memory usage stayed there. Does this example (totally decoupled from any other operations by me) make sense to experienced ZODB people? I have really no functional mental model of ZODB's memory usage patterns. I love using it, but I really want to find some way to get its allocations under control. I'm currently running this on a Macbook Pro, but it seems to be behaving the same way on Windows and Linux. I really appreciate all of the help so far, and if there're any other pieces of my code that might help please let me know. Cheers, Ryan On Mon, May 10, 2010 at 3:18 PM, Jim Fulton j...@zope.com wrote: On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote: First off, thanks everybody. I'm implementing and testing the suggestions now. When I said ZODB was more complicated than my solution I meant that the system was abstracting a lot more from me than my old code (because I wrote it and new exactly how to make the cache enforce its limits!). The first thing to understand is that options like cache-size and cache-size bytes are suggestions, not limits. :) In particular, they are only enforced: - at transaction boundaries, If it's already being called at transaction boundaries how come memory usage doesn't go back down to the quota after the commit (which is only every 25k documents?). Because Python generally doesn't return memory back to the OS. :) It's also possible you have a problem with one of your data structures. For example if you have an array that grows effectively without bound, the array will have to be in memory, no matter how big it is. Also, if the persistent object holding the array isn't seen as changed, because you're appending to the array, then the size of the array won't be reflected in the cache size. (The size of objects in the cache is estimated from their pickle sizes.) I assume you're using ZODB 3.9.5 or later. If not, there's a bug in handling new objects that prevents cache suggestions from working properly. If you don't need list semantics, and set semantics will do, you might consider using an BTrees.LLBtree.TreeSet, which provides compact scalable persistent sets. (If
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
P.S. About the data structures: wordset is a freshly unpickled python set from my old sqlite oodb thingy. The new docsets I'm keeping are 'L' arrays from the stdlib array module. I'm up for using ZODB's builtin persistent data structures if it makes a lot of sense to do so, but it sorta breaks my abstraction a bit and I feel like the memory issues I'm having are somewhat independent of the container data structures (as I'm having the same issue just with fixed size strings). Thanks! -Ryan On Mon, May 10, 2010 at 5:16 PM, Ryan Noon rmn...@gmail.com wrote: Hi all, I've incorporated everybody's advice, but I still can't get memory to obey cache-size-bytes. I'm using the new 3.10 from pypi (but the same behavior happens on the server where I was using 3.10 from the new lucid apt repos). I'm going through a mapping where we take one long integer docid and map it to a collection of long integers (wordset) and trying to invert it into a mapping for each 'wordid in those wordsets to a set of the original docids (docset). I've even tried calling cacheMinimize after every single docset append, but reported memory to the OS never goes down and the process continues to allocate like crazy. I'm wrapping ZODB in a ZMap class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction. Here's the latest version of my code, (minorly instrumented...see below): try: max_docset_size = 0 for docid, wordset in docid_to_wordset.iteritems(): for wordid in wordset: if wordid_to_docset.has_key(wordid): docset = wordid_to_docset[wordid] else: docset = array('L') docset.append(docid) if len(docset) max_docset_size: max_docset_size = len(docset) print 'Max docset is now %d (owned by wordid %d)' % (max_docset_size, wordid) wordid_to_docset[wordid] = docset wordid_to_docset.garbage_collect() wordid_to_docset.connection.cacheMinimize() n_docs_traversed += 1 if n_docs_traversed % 100 == 1: status_tick() if n_docs_traversed % 5 == 1: self.do_commit() self.do_commit() except KeyboardInterrupt, ex: self.log_write('Caught keyboard interrupt, committing...') self.do_commit() I'm keeping track of the greatest docset (which would be the largest possible thing not able to be paged out) and its only 10,152 longs (at 8 bytes each according to the array module's documentation) at the point 75 seconds into the operation when the process has allocated 224 MB (on a cache_size_bytes of 64*1024*1024). On a lark I just made an empty ZMap in the interpreter and filled it with 1M unique strings. It took up something like 190mb. I committed it and mem usage went up to 420mb. I then ran cacheMinimize (memory stayed at 420mb). Then I inserted another 1M entries (strings keyed on ints) and mem usage went up to 820mb. Then I committed and memory usage dropped to ~400mb and went back up to 833mb. Then I ran cacheMinimize again and memory usage stayed there. Does this example (totally decoupled from any other operations by me) make sense to experienced ZODB people? I have really no functional mental model of ZODB's memory usage patterns. I love using it, but I really want to find some way to get its allocations under control. I'm currently running this on a Macbook Pro, but it seems to be behaving the same way on Windows and Linux. I really appreciate all of the help so far, and if there're any other pieces of my code that might help please let me know. Cheers, Ryan On Mon, May 10, 2010 at 3:18 PM, Jim Fulton j...@zope.com wrote: On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote: First off, thanks everybody. I'm implementing and testing the suggestions now. When I said ZODB was more complicated than my solution I meant that the system was abstracting a lot more from me than my old code (because I wrote it and new exactly how to make the cache enforce its limits!). The first thing to understand is that options like cache-size and cache-size bytes are suggestions, not limits. :) In particular, they are only enforced: - at transaction boundaries, If it's already being called at transaction boundaries how come memory usage doesn't go back down to the quota after the commit (which is only every 25k documents?). Because Python generally doesn't return memory back to the OS. :) It's also possible you have a problem with one of your data structures. For example if you have an array that grows effectively without bound, the
Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)
I think that moving to an LLTreeSet for the docset will significantly reduce your memory usage. Non persistent objects are stored as part of their parent persistent object's record. Each LOBTree object bucket contains up to 60 (key, value) pairs. When the values are non-persistent objects they are stored as part of the bucket object's record, and so accessing any key of a bucket in a transaction brings up to 60 docsets into memory. I would not be surprised if your program forces most of your data into memory each batch - as most words are in most documents. At the very least you should move to an LLSet (essentially a single BTree bucket). An LLTreeSet has the additional advantage of being scalable to many values, and if under load from multiple clients you are far less likely to see conflicts. Laurence On 11 May 2010 01:20, Ryan Noon rmn...@gmail.com wrote: P.S. About the data structures: wordset is a freshly unpickled python set from my old sqlite oodb thingy. The new docsets I'm keeping are 'L' arrays from the stdlib array module. I'm up for using ZODB's builtin persistent data structures if it makes a lot of sense to do so, but it sorta breaks my abstraction a bit and I feel like the memory issues I'm having are somewhat independent of the container data structures (as I'm having the same issue just with fixed size strings). Thanks! -Ryan On Mon, May 10, 2010 at 5:16 PM, Ryan Noon rmn...@gmail.com wrote: Hi all, I've incorporated everybody's advice, but I still can't get memory to obey cache-size-bytes. I'm using the new 3.10 from pypi (but the same behavior happens on the server where I was using 3.10 from the new lucid apt repos). I'm going through a mapping where we take one long integer docid and map it to a collection of long integers (wordset) and trying to invert it into a mapping for each 'wordid in those wordsets to a set of the original docids (docset). I've even tried calling cacheMinimize after every single docset append, but reported memory to the OS never goes down and the process continues to allocate like crazy. I'm wrapping ZODB in a ZMap class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction. Here's the latest version of my code, (minorly instrumented...see below): try: max_docset_size = 0 for docid, wordset in docid_to_wordset.iteritems(): for wordid in wordset: if wordid_to_docset.has_key(wordid): docset = wordid_to_docset[wordid] else: docset = array('L') docset.append(docid) if len(docset) max_docset_size: max_docset_size = len(docset) print 'Max docset is now %d (owned by wordid %d)' % (max_docset_size, wordid) wordid_to_docset[wordid] = docset wordid_to_docset.garbage_collect() wordid_to_docset.connection.cacheMinimize() n_docs_traversed += 1 if n_docs_traversed % 100 == 1: status_tick() if n_docs_traversed % 5 == 1: self.do_commit() self.do_commit() except KeyboardInterrupt, ex: self.log_write('Caught keyboard interrupt, committing...') self.do_commit() I'm keeping track of the greatest docset (which would be the largest possible thing not able to be paged out) and its only 10,152 longs (at 8 bytes each according to the array module's documentation) at the point 75 seconds into the operation when the process has allocated 224 MB (on a cache_size_bytes of 64*1024*1024). On a lark I just made an empty ZMap in the interpreter and filled it with 1M unique strings. It took up something like 190mb. I committed it and mem usage went up to 420mb. I then ran cacheMinimize (memory stayed at 420mb). Then I inserted another 1M entries (strings keyed on ints) and mem usage went up to 820mb. Then I committed and memory usage dropped to ~400mb and went back up to 833mb. Then I ran cacheMinimize again and memory usage stayed there. Does this example (totally decoupled from any other operations by me) make sense to experienced ZODB people? I have really no functional mental model of ZODB's memory usage patterns. I love using it, but I really want to find some way to get its allocations under control. I'm currently running this on a Macbook Pro, but it seems to be behaving the same way on Windows and Linux. I really appreciate all of the help so far, and if there're any other pieces of my code that might help please let me know. Cheers, Ryan On Mon, May 10, 2010 at 3:18 PM, Jim Fulton j...@zope.com wrote: On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote: First off, thanks everybody. I'm