Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-18 Thread Alan Runyan
On Tue, May 18, 2010 at 1:14 PM, Ryan Noon rmn...@gmail.com wrote:
 Hi All,
 I converted my code to use LOBTrees holding LLTreeSets and it sticks to the
 memory bounds and performs admirably throughout the whole process.
  Unfortunately opening the database afterwards seems to be really really
 slow.  Here's what I'm doing:
 from ZODB.FileStorage import FileStorage
 from ZODB.DB import DB
 storage = FileStorage('attempt3_wordid_to_docset',pack_keep_old=False)
 I think the file in question is about 7 GB in size.  It's using 100 percent
 of a core and I've never seen it get past the FileStorage object creation.
  Is there something I'm doing wrong when I initially fill this storage that
 makes it so hard to index, or is there something wrong with the way I'm
 creating the new FileStorage?

Is there a 'index' file that is being created?  It would be in the
same directory as the database file.

How are you closing the application?

If you see the index file changing when you start up; it is probably
rebuilding the index.

-alan
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-12 Thread Jim Fulton
On Tue, May 11, 2010 at 7:37 PM, Ryan Noon rmn...@gmail.com wrote:
...
 (a pointer to relevant documentation would be really
 useful)

A major deficiency of ZODB is that there is effectively no standard
documentation.

I'm working on fixing this.

Jim

-- 
Jim Fulton
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-12 Thread Jim Fulton
On Tue, May 11, 2010 at 7:37 PM, Ryan Noon rmn...@gmail.com wrote:
 Hi Jim,
 I'm really sorry for the miscommunication, I thought I made that clear in my
 last email:
 I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary
 methods to the ZODB root and allows easy interchangeability with my old
 sqlite OODB abstraction.

Perhaps I should have picked up on this, but it wasn't clear that you
were refering to word_id_docset. I couldn't see that in the code and I
didn't get an answer to my question.

 wordid_to_docset is a ZMap, which just wraps the ZODB
 boilerplate/connection and forwards dictionary methods to the root.

This is the last piece to the puzzle.  The root object is a persistent
mapping object that is a single database object and is thus not a
scalable data structure.  As Lawrence pointed out, this, together with
the fact that you're using non-persistent arrays as mapping values
means that all your data is in a single object.

 but I'm still sorta worried because in my experimentation with ZODB
 so far I've never been able to observe it sticking to any cache limits, no
 matter how often I tell it to garbage collect (even when storing very small
 values that should give it adequate granularity...see my experiment at the
 end of my last email).

The unit of granularity is the persistent object.  It is persitent
object that are managed by the cache, not indivdual Python objects
like strings.  If your entire database is in a single persistent
object, then you're entire database will be in memory.

If you want a scallable mapping and your keys are stabley ordered (as
are strings and numbers) then you should use a BTree.  BTrees spread
there data over multiple data records, so you can have massive
mappings without storing massive amounts of data in memory.

If you want a set and the items are stabley ordered, then a TreeSet
(or a Set if the set is known to be small.)

There are build-in BTrees and sets that support compact storage of
signed 32-bit or 64-bit ints.

Jim

-- 
Jim Fulton
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-11 Thread Ryan Noon
Hi Jim,

I'm really sorry for the miscommunication, I thought I made that clear in my
last email:

I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary
methods to the ZODB root and allows easy interchangeability with my old
sqlite OODB abstraction.

wordid_to_docset is a ZMap, which just wraps the ZODB
boilerplate/connection and forwards dictionary methods to the root.  If this
seems superfluous, it was just to maintain backwards compatibility with all
of the code I'd already written for the sqlite OODB I was using before I
switched to ZODB.  Whenever you see something like wordid_to_docset[id] it's
just doing self.root[id] behind the scenes in a __setitem__ call inside the
ZMap class, which I've pasted below.

The db is just storing longs mapped to array('L')'s with a few thousand
longs in em.  I'm going to try switching to the persistent data structure
that Laurence suggested (a pointer to relevant documentation would be really
useful), but I'm still sorta worried because in my experimentation with ZODB
so far I've never been able to observe it sticking to any cache limits, no
matter how often I tell it to garbage collect (even when storing very small
values that should give it adequate granularity...see my experiment at the
end of my last email).  If the memory reported to the OS by Python 2.6 is
the problem I'd understand, but memory usage goes up the second I start
adding new things (which indicates that Python is asking for more and not
actually freeing internally, no?).

If you feel there's something pathological about my memory access patterns
in this operation I can just do the actual inversion step in Hadoop and load
the output into ZODB for my application later, I was just hoping to keep all
of my data in OODB's the entire time.

Thanks again all of you for your collective time.  I really like ZODB so
far, and it bugs me that I'm likely screwing it up somewhere.

Cheers,
Ryan



class ZMap(object):

def __init__(self, name=None, dbfile=None, cache_size_mb=512,
autocommit=True):
self.name = name
self.dbfile = dbfile
self.autocommit = autocommit

self.__hash__ = None #can't hash this

#first things first, figure out if we need to make up a name
if self.name == None:
self.name = make_up_name()
if sep in self.name:
if self.name[-1] == sep:
self.name = self.name[:-1]
self.name = self.name.split(sep)[-1]


if self.dbfile == None:
self.dbfile = self.name + '.zdb'

self.storage = FileStorage(self.dbfile, pack_keep_old=False)
self.cache_size = cache_size_mb * 1024 * 1024

self.db = DB(self.storage, pool_size=1,
cache_size_bytes=self.cache_size,
historical_cache_size_bytes=self.cache_size, database_name=self.name)
self.connection = self.db.open()
self.root = self.connection.root()

print 'Initializing ZMap %s in file %s with %dmb cache. Current
%d items' % (self.name, self.dbfile, cache_size_mb, len(self.root))



# basic operators
def __eq__(self, y): # x == y
return self.root.__eq__(y)
def __ge__(self, y): # x = y
return len(self) = len(y)
def __gt__(self, y): # x  y
return len(self)  len(y)
def __le__(self, y): # x = y
return not self.__gt__(y)
def __lt__(self, y): # x  y
return not self.__ge__(y)
def __len__(self): # len(x)
return len(self.root)


# dictionary stuff
def __getitem__(self, key): # x[key]
return self.root[key]

def __setitem__(self, key, value): # x[key] = value
self.root[key] = value
self.__commit_check() # write back if necessary

def __delitem__(self, key): # del x[key]
del self.root[key]


def get(self, key, default=None): # x[key] if key in x, else default
return self.root.get(key, default)

def has_key(self, key): # True if x has key, else False
return self.root.has_key(key)

def items(self): # list of key/val pairs
return self.root.items()

def keys(self):
return self.root.keys()

def pop(self, key, default=None):
return self.root.pop()

def popitem(self): #remove and return an arbitrary key/val pair
return self.root.popitem()

def setdefault(self, key, default=None):
#D.setdefault(k[,d]) - D.get(k,d), also set D[k]=d if k not in D
return self.root.setdefault(key, default)

def values(self):
return self.root.values()

def copy(self): #copy it? dubiously necessary at the moment
NOT_IMPLEMENTED('copy')


# iteration
def __iter__(self): # iter(x)
return self.root.iterkeys()

def iteritems(self): #iterator over items, this can be hellaoptimized
return self.root.iteritems()


def itervalues(self):
return self.root.itervalues()

def iterkeys(self):
return self.root.iterkeys()


# practical 

Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-11 Thread Laurence Rowe
I think this means that you are storing all of your data in a single
persistent object, the database root PersistentMapping. You need to
break up your data into persistent objects (instances of objects that
inherit from persistent.Persistent) for the ZODB to have a chance of
performing memory mapping. You want to do something like:

import transaction
from ZODB import FileStorage, DB
from BTrees.LOBTree import BTree, TreeSet
storage = FileStorage.FileStorage('/tmp/test-filestorage.fs')
db = DB(storage)
conn = db.open()
root = conn.root()
transaction.begin()
index = root['index'] = BTree()
values = index[1] = TreeSet()
values.add(42)
transaction.commit()

You should probably read:
http://www.zodb.org/documentation/guide/modules.html#btrees-package.
Since that was written an L variants of the BTree types have been
introduced for storing 64bit integers. I'm using an LOBTree because
that maps 64bit integers to python objects. For values I'm using an
LOTreeSet, though you could also use an LLTreeSet (which has larger
buckets).

Laurence

On 12 May 2010 00:37, Ryan Noon rmn...@gmail.com wrote:
 Hi Jim,
 I'm really sorry for the miscommunication, I thought I made that clear in my
 last email:
 I'm wrapping ZODB in a 'ZMap' class that just forwards all the dictionary
 methods to the ZODB root and allows easy interchangeability with my old
 sqlite OODB abstraction.
 wordid_to_docset is a ZMap, which just wraps the ZODB
 boilerplate/connection and forwards dictionary methods to the root.  If this
 seems superfluous, it was just to maintain backwards compatibility with all
 of the code I'd already written for the sqlite OODB I was using before I
 switched to ZODB.  Whenever you see something like wordid_to_docset[id] it's
 just doing self.root[id] behind the scenes in a __setitem__ call inside the
 ZMap class, which I've pasted below.
 The db is just storing longs mapped to array('L')'s with a few thousand
 longs in em.  I'm going to try switching to the persistent data structure
 that Laurence suggested (a pointer to relevant documentation would be really
 useful), but I'm still sorta worried because in my experimentation with ZODB
 so far I've never been able to observe it sticking to any cache limits, no
 matter how often I tell it to garbage collect (even when storing very small
 values that should give it adequate granularity...see my experiment at the
 end of my last email).  If the memory reported to the OS by Python 2.6 is
 the problem I'd understand, but memory usage goes up the second I start
 adding new things (which indicates that Python is asking for more and not
 actually freeing internally, no?).
 If you feel there's something pathological about my memory access patterns
 in this operation I can just do the actual inversion step in Hadoop and load
 the output into ZODB for my application later, I was just hoping to keep all
 of my data in OODB's the entire time.
 Thanks again all of you for your collective time.  I really like ZODB so
 far, and it bugs me that I'm likely screwing it up somewhere.
 Cheers,
 Ryan


 class ZMap(object):

     def __init__(self, name=None, dbfile=None, cache_size_mb=512,
 autocommit=True):
         self.name = name
         self.dbfile = dbfile
         self.autocommit = autocommit

         self.__hash__ = None #can't hash this

         #first things first, figure out if we need to make up a name
         if self.name == None:
             self.name = make_up_name()
         if sep in self.name:
             if self.name[-1] == sep:
                 self.name = self.name[:-1]
             self.name = self.name.split(sep)[-1]


         if self.dbfile == None:
             self.dbfile = self.name + '.zdb'

         self.storage = FileStorage(self.dbfile, pack_keep_old=False)
         self.cache_size = cache_size_mb * 1024 * 1024

         self.db = DB(self.storage, pool_size=1,
 cache_size_bytes=self.cache_size,
 historical_cache_size_bytes=self.cache_size, database_name=self.name)
         self.connection = self.db.open()
         self.root = self.connection.root()

         print 'Initializing ZMap %s in file %s with %dmb cache. Current
 %d items' % (self.name, self.dbfile, cache_size_mb, len(self.root))

     # basic operators
     def __eq__(self, y): # x == y
         return self.root.__eq__(y)
     def __ge__(self, y): # x = y
         return len(self) = len(y)
     def __gt__(self, y): # x  y
         return len(self)  len(y)
     def __le__(self, y): # x = y
         return not self.__gt__(y)
     def __lt__(self, y): # x  y
         return not self.__ge__(y)
     def __len__(self): # len(x)
         return len(self.root)


     # dictionary stuff
     def __getitem__(self, key): # x[key]
         return self.root[key]
     def __setitem__(self, key, value): # x[key] = value
         self.root[key] = value
         self.__commit_check() # write back if necessary

     def __delitem__(self, key): # del x[key]
         del self.root[key]

     def get(self, key, 

Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-11 Thread Ryan Noon
Thanks Laurence, this looks really helpful.  The simplicity of ZODB's
concept and the joy of using it apparently hides some of the complexity
necessary to use it efficiently.  I'll check this out when I circle back to
data stuff tomorrow.

Have a great morning/day/evening!
-Ryan

On Tue, May 11, 2010 at 5:44 PM, Laurence Rowe l...@lrowe.co.uk wrote:

 I think this means that you are storing all of your data in a single
 persistent object, the database root PersistentMapping. You need to
 break up your data into persistent objects (instances of objects that
 inherit from persistent.Persistent) for the ZODB to have a chance of
 performing memory mapping. You want to do something like:

 import transaction
 from ZODB import FileStorage, DB
 from BTrees.LOBTree import BTree, TreeSet
 storage = FileStorage.FileStorage('/tmp/test-filestorage.fs')
 db = DB(storage)
 conn = db.open()
 root = conn.root()
 transaction.begin()
 index = root['index'] = BTree()
 values = index[1] = TreeSet()
 values.add(42)
 transaction.commit()

 You should probably read:
 http://www.zodb.org/documentation/guide/modules.html#btrees-package.
 Since that was written an L variants of the BTree types have been
 introduced for storing 64bit integers. I'm using an LOBTree because
 that maps 64bit integers to python objects. For values I'm using an
 LOTreeSet, though you could also use an LLTreeSet (which has larger
 buckets).

 Laurence

 On 12 May 2010 00:37, Ryan Noon rmn...@gmail.com wrote:
  Hi Jim,
  I'm really sorry for the miscommunication, I thought I made that clear in
 my
  last email:
  I'm wrapping ZODB in a 'ZMap' class that just forwards all the
 dictionary
  methods to the ZODB root and allows easy interchangeability with my old
  sqlite OODB abstraction.
  wordid_to_docset is a ZMap, which just wraps the ZODB
  boilerplate/connection and forwards dictionary methods to the root.  If
 this
  seems superfluous, it was just to maintain backwards compatibility with
 all
  of the code I'd already written for the sqlite OODB I was using before I
  switched to ZODB.  Whenever you see something like wordid_to_docset[id]
 it's
  just doing self.root[id] behind the scenes in a __setitem__ call inside
 the
  ZMap class, which I've pasted below.
  The db is just storing longs mapped to array('L')'s with a few thousand
  longs in em.  I'm going to try switching to the persistent data structure
  that Laurence suggested (a pointer to relevant documentation would be
 really
  useful), but I'm still sorta worried because in my experimentation with
 ZODB
  so far I've never been able to observe it sticking to any cache limits,
 no
  matter how often I tell it to garbage collect (even when storing very
 small
  values that should give it adequate granularity...see my experiment at
 the
  end of my last email).  If the memory reported to the OS by Python 2.6 is
  the problem I'd understand, but memory usage goes up the second I start
  adding new things (which indicates that Python is asking for more and not
  actually freeing internally, no?).
  If you feel there's something pathological about my memory access
 patterns
  in this operation I can just do the actual inversion step in Hadoop and
 load
  the output into ZODB for my application later, I was just hoping to keep
 all
  of my data in OODB's the entire time.
  Thanks again all of you for your collective time.  I really like ZODB so
  far, and it bugs me that I'm likely screwing it up somewhere.
  Cheers,
  Ryan
 
 
  class ZMap(object):
 
  def __init__(self, name=None, dbfile=None, cache_size_mb=512,
  autocommit=True):
  self.name = name
  self.dbfile = dbfile
  self.autocommit = autocommit
 
  self.__hash__ = None #can't hash this
 
  #first things first, figure out if we need to make up a name
  if self.name == None:
  self.name = make_up_name()
  if sep in self.name:
  if self.name[-1] == sep:
  self.name = self.name[:-1]
  self.name = self.name.split(sep)[-1]
 
 
  if self.dbfile == None:
  self.dbfile = self.name + '.zdb'
 
  self.storage = FileStorage(self.dbfile, pack_keep_old=False)
  self.cache_size = cache_size_mb * 1024 * 1024
 
  self.db = DB(self.storage, pool_size=1,
  cache_size_bytes=self.cache_size,
  historical_cache_size_bytes=self.cache_size, database_name=self.name)
  self.connection = self.db.open()
  self.root = self.connection.root()
 
  print 'Initializing ZMap %s in file %s with %dmb cache.
 Current
  %d items' % (self.name, self.dbfile, cache_size_mb, len(self.root))
 
  # basic operators
  def __eq__(self, y): # x == y
  return self.root.__eq__(y)
  def __ge__(self, y): # x = y
  return len(self) = len(y)
  def __gt__(self, y): # x  y
  return len(self)  len(y)
  def __le__(self, y): # x = y
  return not self.__gt__(y)
  

Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Ryan Noon
Thanks for your quick reply!

So, the best place to call those would be during my commit break (whenever I
decide to take it? [which would be less often if I could be sure of no
crashing]).  Are there any other problems with the way I was using ZODB in
my code?  I really like it, but I recognize that it's a lot more complicated
than my old system.

Cheers,
Ryan

On Mon, May 10, 2010 at 12:48 PM, Alan Runyan runy...@gmail.com wrote:

  The DB on the choked process is perfectly good up to the last commit when
 it
  choked, and I've even tried extremely small values of cache_size_bytes
 and
  cache_size, just to see if I can get it to stop allocating memory and
  nothing seems to work.  I've also used string values ('128mb') for
  cache-size-bytes, etc.

 On the connection object there are two methods you want to use:
  - cacheMinimize
  This is more of a heavy hand which attempts to deactive *all* non
  modified objects from cache.

  - cacheGC
  This will clean up the internal cache via the cache-byte-size parameter.

 If you are not calling these (I do not believe they are called in
 trnx.commit)
 in your code; then they are probably not being called.

 cheers
 alan




-- 
Ryan Noon
Stanford Computer Science
BS '09, MS '10
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Alan Runyan
 So, the best place to call those would be during my commit break (whenever I
 decide to take it? [which would be less often if I could be sure of no
 crashing]).  Are there any other problems with the way I was using ZODB in
 my code?  I really like it, but I recognize that it's a lot more complicated
 than my old system.

Correct.  Pick appropriate place where you are finished with a batch
of objects
and possibly have them no longer referenced.  Then you can call cacheMinimize.
You can call cacheGC to reduce memory usage.

Your code looks straight forward.  I do not see it doing anything strange.

It's more complicated than old system? Ugh.  That sucks.  The ZODB should
be LESS complicated than sqlite+custom rolled caching.  If not than you may
be doing something wrong or ZODB is not living up to its promise *wink*

cheers
alan
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Jim Fulton
On Mon, May 10, 2010 at 3:27 PM, Ryan Noon rmn...@gmail.com wrote:
 Hi everyone,
 I recently switched over some of my home-rolled sqlite backed object
 databases into ZODB based on what I'd read and some cool performance numbers
 I'd seen.  I'm really happy with the entire system so far except for one
 really irritating problem: memory usage.
 I'm doing a rather intensive operation where I'm inverting a mapping of the
 form (docid = [wordid]) for about 3 million documents (for about 8 million
 unique words).  I thought about doing it on hadoop, but it's a one time
 thing and it'd be nice if I didn't have to load the data back into an object
 database for my application at the end anyway.
 Anyhoo, in the process of this operation (which performs much faster than my
 sqlite+python cache solution) memory usage never really drops.  I'm
 currently doing a commit every 25k documents.   The python process just
 gobbles up RAM, though.  I made it through 750k documents before my 8GB
 Ubuntu 10.04 server choked and killed the process (at about 80 percent mem
 usage).  (The same thing happens on Windows and OSX, btw).
 I figure either there's a really tremendous bug in ZODB (unlikely given its
 age and venerability) or I'm really doing it wrong.  Here's my code:

         self.storage = FileStorage(self.dbfile, pack_keep_old=False)
         cache_size = 512 * 1024 * 1024

         self.db = DB(self.storage, pool_size=1, cache_size_bytes=cache_size,
 historical_cache_size_bytes=cache_size, database_name=self.name)
         self.connection = self.db.open()
         self.root = self.connection.root()

 and the actual insertions...
             set_default = wordid_to_docset.root.setdefault #i can be kinda
 pathological with loop operations
             array_append = array.append
             for docid, wordset in docid_to_wordset.iteritems(): #one of my
 older sqlite oodb's, not maintaining a cache...just iterating (small
 constant mem usage)
                 for wordid in wordset:
                     docset = set_default(wordid, array('L'))
                     array_append(docset, docid)

                 n_docs_traversed += 1
                 if n_docs_traversed % 1000 == 1:
                     status_tick()
                 if n_docs_traversed % 25000 == 1:
                     self.do_commit() #just commits the oodb by calling
 transaction.commit()
 The DB on the choked process is perfectly good up to the last commit when it
 choked, and I've even tried extremely small values of cache_size_bytes and
 cache_size, just to see if I can get it to stop allocating memory and
 nothing seems to work.  I've also used string values ('128mb') for
 cache-size-bytes, etc.

 Can somebody help me out?

The first thing to understand is that options like cache-size and
cache-size bytes are suggestions, not limits. :)  In particular, they
are only enforced:

- at transaction boundaries,

- when an application creates a savepoint,

- or when an application invokes garbage collection explicitly via the
  cacheGC or cacheMinimize methods.

Note that objects that have been monified but not committed won't be
freed even if the suggestions are exceeded.

The reason that ZODB never frees objects on it's own is that doing so
could lead to surprising changes to object state and subtle
bugs. Consider:

def append(self, item):
self._data.append(item) # self._data is just a Python dict
# At this point, ZODB doesn't know that self has changed.
# If ZODB was willing to free an object whenever it wanted to,
# self could be freed here, losing the change to self._data.
self._length += 1
# Now self is marked as changed, but too late if self was
# freed above.

Also note that memory allocated by Python is generally not returned to
the OS when freed.

Calling cacheGC at transaction boundaries won't buy you anything.
It's already called then. :)

In your script, I'd recommend calling cacheGC after processing each
document:

   root._p_jar.cacheGC()

This will keep the cache full, which will hopefully help performance
without letting it grow far out of bounds.

Jim

--
Jim Fulton
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Jim Fulton
On Mon, May 10, 2010 at 4:58 PM, Jim Fulton j...@zope.com wrote:
...
 The first thing to understand is that options like cache-size and
 cache-size bytes are suggestions, not limits. :)  In particular, they
 are only enforced:

 - at transaction boundaries,

 - when an application creates a savepoint,

 - or when an application invokes garbage collection explicitly via the
  cacheGC or cacheMinimize methods.

 Note that objects that have been monified but not committed won't be
 freed even if the suggestions are exceeded.

 The reason that ZODB never frees objects on it's own is that doing so
 could lead to surprising changes to object state and subtle
 bugs. Consider:

    def append(self, item):
        self._data.append(item) # self._data is just a Python dict
        # At this point, ZODB doesn't know that self has changed.
        # If ZODB was willing to free an object whenever it wanted to,
        # self could be freed here, losing the change to self._data.
        self._length += 1
        # Now self is marked as changed, but too late if self was
        # freed above.

I meant to add that it might be interesting to error or warn when a cache
gets much larger than a suggestion.

Jim

-- 
Jim Fulton
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jim Fulton wrote:
 On Mon, May 10, 2010 at 3:27 PM, Ryan Noon rmn...@gmail.com wrote:

snip

 Here's my code:

 self.storage = FileStorage(self.dbfile, pack_keep_old=False)
 cache_size = 512 * 1024 * 1024

 self.db = DB(self.storage, pool_size=1, cache_size_bytes=cache_size,
 historical_cache_size_bytes=cache_size, database_name=self.name)
 self.connection = self.db.open()
 self.root = self.connection.root()

 and the actual insertions...
 set_default = wordid_to_docset.root.setdefault #i can be kinda
 pathological with loop operations
 array_append = array.append
 for docid, wordset in docid_to_wordset.iteritems(): #one of my
 older sqlite oodb's, not maintaining a cache...just iterating (small
 constant mem usage)
 for wordid in wordset:
 docset = set_default(wordid, array('L'))

Note that you are creating the array willy-nilly in the inner loop here.
 I would nearly always write that as::

   docset = wordid_to_docset.root.get(wordid)
   if docset is None:
   docset = array('L')
   wordid_to_docset.root[worid] = docet

 array_append(docset, docid)

Why are you using an unbound method here?  The following would be
clearer, and almost certainly not noticeably slower:

   docset.append(docid)

 n_docs_traversed += 1
 if n_docs_traversed % 1000 == 1:
 status_tick()
 if n_docs_traversed % 25000 == 1:
 self.do_commit() #just commits the oodb by calling
 transaction.commit()

Don't forget the final commit. ;)  Also, I don't know what the 'array'
type is here, but if it doesn't manage its own persistence, then you
have a bug here:  mutating a non-persistent sub-object doesn't
automatically cause the persistent container to register as dirty with
the transaction, which means you may lose changes after the object is
evicted from the RAM cache, or at shutdown.

snip

 Also note that memory allocated by Python is generally not returned to
 the OS when freed.

Python's own internal heap management has gotten noticeably better about
returning reclaimed chunks to the OS in 2.6.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkvoeIoACgkQ+gerLs4ltQ6WjACgsvDmG96nD2iPl/noiHS5ThdL
SdIAn1Ei+yfzRyJ4W1lwvuThBj9BxzGt
=nrBB
-END PGP SIGNATURE-

___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Jim Fulton
On Mon, May 10, 2010 at 5:20 PM, Tres Seaver tsea...@palladion.com wrote:
...
 Also note that memory allocated by Python is generally not returned to
 the OS when freed.

 Python's own internal heap management has gotten noticeably better about
 returning reclaimed chunks to the OS in 2.6.

Yeah, I've heard that. I tried to verify this, but have since
repressed the result. ;)

Jim

-- 
Jim Fulton
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Ryan Noon
First off, thanks everybody.  I'm implementing and testing the suggestions
now.  When I said ZODB was more complicated than my solution I meant that
the system was abstracting a lot more from me than my old code (because I
wrote it and new exactly how to make the cache enforce its limits!).

 The first thing to understand is that options like cache-size and
  cache-size bytes are suggestions, not limits. :)  In particular, they
  are only enforced:
 
  - at transaction boundaries,


If it's already being called at transaction boundaries how come memory usage
doesn't go back down to the quota after the commit (which is only every 25k
documents?).

With regards to returning memory to the OS, I don't really care if it
reports less, but it really seems like it's overallocating if the OS kills
it on an 8GB machine with a 512mb quota.

Tres:

With your first point:

Yeah, I wrote that late last night and I just realized it's getting
evaluated stupidly on the setdefault call.  I was trying to be cute with
Python dict methods that I hadn't used before.  Stupid me.

With regards to your second point:

I read the loop optimization wiki page over at python.org too many times and
I get itchy whenever there's method lookup inside of a loop. I need to
remember I'm dealing with a database here and IO is gonna be the bottleneck
anyway.

With regards to your third point:

I actually ran into the same change notification problem when I was rolling
my own OODB and I assumed ZODB had done something tricky because my changes
were showing up upon reopening the db even when I'd done the append and not
told ZODB about the change. I'll fix it to make that more explicit...I think
the magical effects I'd seen were related to my problem with too much damn
caching.  The array type is from the stdlib array module that I'm just
appending my IDs to as longs.  I figured it'd be more compact and would
serialize faster.


Btw, the final commit is outside the loop. (not shown). =)

Cheers,
Ryan

-- 
Ryan Noon
Stanford Computer Science
BS '09, MS '10
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Jim Fulton
On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote:
 First off, thanks everybody.  I'm implementing and testing the suggestions
 now.  When I said ZODB was more complicated than my solution I meant that
 the system was abstracting a lot more from me than my old code (because I
 wrote it and new exactly how to make the cache enforce its limits!).

  The first thing to understand is that options like cache-size and
  cache-size bytes are suggestions, not limits. :)  In particular, they
  are only enforced:
 
  - at transaction boundaries,

 If it's already being called at transaction boundaries how come memory usage
 doesn't go back down to the quota after the commit (which is only every 25k
 documents?).

Because Python generally doesn't return memory back to the OS. :)

It's also possible you have a problem with one of your data
structures.  For example if you have an array that grows effectively
without bound, the array will have to be in memory, no matter how big
it is.  Also, if the persistent object holding the array isn't seen as
changed, because you're appending to the array, then the size of the
array won't be reflected in the cache size. (The size of objects in
the cache is estimated from their pickle sizes.)

I assume you're using ZODB 3.9.5 or later. If not, there's a bug in
handling new objects that prevents cache suggestions from working
properly.

If you don't need list semantics, and set semantics will do, you might
consider using an BTrees.LLBtree.TreeSet, which provides compact
scalable persistent sets.  (If your word ids can be signed, you could
ise the IIBTree variety, which is more compact.) Given the variable
name is wordset, then I assume you're dealing with sets. :)

What is wordid_to_docset? You don't show it's creation.

Jim

--
Jim Fulton
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Ryan Noon
Hi all,

I've incorporated everybody's advice, but I still can't get memory to obey
cache-size-bytes.  I'm using the new 3.10 from pypi (but the same behavior
happens on the server where I was using 3.10 from the new lucid apt repos).

I'm going through a mapping where we take one long integer docid and map
it to a collection of long integers (wordset) and trying to invert it into
a mapping for each 'wordid in those wordsets to a set of the original
docids (docset).

I've even tried calling cacheMinimize after every single docset append, but
reported memory to the OS never goes down and the process continues to
allocate like crazy.

I'm wrapping ZODB in a ZMap class that just forwards all the dictionary
methods to the ZODB root and allows easy interchangeability with my old
sqlite OODB abstraction.

Here's the latest version of my code, (minorly instrumented...see below):

try:
max_docset_size = 0
for docid, wordset in docid_to_wordset.iteritems():
for wordid in wordset:
if wordid_to_docset.has_key(wordid):
docset = wordid_to_docset[wordid]
else:
docset = array('L')
docset.append(docid)
if len(docset)  max_docset_size:
max_docset_size = len(docset)
print 'Max docset is now %d (owned by wordid %d)' %
(max_docset_size, wordid)
wordid_to_docset[wordid] = docset
wordid_to_docset.garbage_collect()
wordid_to_docset.connection.cacheMinimize()

n_docs_traversed += 1


if n_docs_traversed % 100 == 1:
status_tick()
if n_docs_traversed % 5 == 1:
self.do_commit()

self.do_commit()
except KeyboardInterrupt, ex:
self.log_write('Caught keyboard interrupt, committing...')
self.do_commit()

I'm keeping track of the greatest docset (which would be the largest
possible thing not able to be paged out) and its only 10,152 longs (at 8
bytes each according to the array module's documentation) at the point 75
seconds into the operation when the process has allocated 224 MB (on a
cache_size_bytes of 64*1024*1024).


On a lark I just made an empty ZMap in the interpreter and filled it with 1M
unique strings.  It took up something like 190mb.  I committed it and mem
usage went up to 420mb.  I then ran cacheMinimize (memory stayed at 420mb).
 Then I inserted another 1M entries (strings keyed on ints) and mem usage
went up to 820mb.  Then I committed and memory usage dropped to ~400mb and
went back up to 833mb.  Then I ran cacheMinimize again and memory usage
stayed there.  Does this example (totally decoupled from any other
operations by me) make sense to experienced ZODB people?  I have really no
functional mental model of ZODB's memory usage patterns.  I love using it,
but I really want to find some way to get its allocations under control.
 I'm currently running this on a Macbook Pro, but it seems to be behaving
the same way on Windows and Linux.

I really appreciate all of the help so far, and if there're any other pieces
of my code that might help please let me know.

Cheers,
Ryan

On Mon, May 10, 2010 at 3:18 PM, Jim Fulton j...@zope.com wrote:

 On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote:
  First off, thanks everybody.  I'm implementing and testing the
 suggestions
  now.  When I said ZODB was more complicated than my solution I meant that
  the system was abstracting a lot more from me than my old code (because I
  wrote it and new exactly how to make the cache enforce its limits!).
 
   The first thing to understand is that options like cache-size and
   cache-size bytes are suggestions, not limits. :)  In particular, they
   are only enforced:
  
   - at transaction boundaries,
 
  If it's already being called at transaction boundaries how come memory
 usage
  doesn't go back down to the quota after the commit (which is only every
 25k
  documents?).

 Because Python generally doesn't return memory back to the OS. :)

 It's also possible you have a problem with one of your data
 structures.  For example if you have an array that grows effectively
 without bound, the array will have to be in memory, no matter how big
 it is.  Also, if the persistent object holding the array isn't seen as
 changed, because you're appending to the array, then the size of the
 array won't be reflected in the cache size. (The size of objects in
 the cache is estimated from their pickle sizes.)

 I assume you're using ZODB 3.9.5 or later. If not, there's a bug in
 handling new objects that prevents cache suggestions from working
 properly.

 If you don't need list semantics, and set semantics will do, you might
 consider using an BTrees.LLBtree.TreeSet, which provides compact
 scalable persistent sets.  (If 

Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Ryan Noon
P.S. About the data structures:

wordset is a freshly unpickled python set from my old sqlite oodb thingy.

The new docsets I'm keeping are 'L' arrays from the stdlib array module.
 I'm up for using ZODB's builtin persistent data structures if it makes a
lot of sense to do so, but it sorta breaks my abstraction a bit and I feel
like the memory issues I'm having are somewhat independent of the container
data structures (as I'm having the same issue just with fixed size strings).

Thanks!
-Ryan

On Mon, May 10, 2010 at 5:16 PM, Ryan Noon rmn...@gmail.com wrote:

 Hi all,

 I've incorporated everybody's advice, but I still can't get memory to obey
 cache-size-bytes.  I'm using the new 3.10 from pypi (but the same behavior
 happens on the server where I was using 3.10 from the new lucid apt repos).

 I'm going through a mapping where we take one long integer docid and map
 it to a collection of long integers (wordset) and trying to invert it into
 a mapping for each 'wordid in those wordsets to a set of the original
 docids (docset).

 I've even tried calling cacheMinimize after every single docset append, but
 reported memory to the OS never goes down and the process continues to
 allocate like crazy.

 I'm wrapping ZODB in a ZMap class that just forwards all the dictionary
 methods to the ZODB root and allows easy interchangeability with my old
 sqlite OODB abstraction.

 Here's the latest version of my code, (minorly instrumented...see below):

 try:
 max_docset_size = 0
 for docid, wordset in docid_to_wordset.iteritems():
 for wordid in wordset:
 if wordid_to_docset.has_key(wordid):
 docset = wordid_to_docset[wordid]
 else:
  docset = array('L')
 docset.append(docid)
 if len(docset)  max_docset_size:
 max_docset_size = len(docset)
 print 'Max docset is now %d (owned by wordid %d)' %
 (max_docset_size, wordid)
 wordid_to_docset[wordid] = docset
 wordid_to_docset.garbage_collect()
 wordid_to_docset.connection.cacheMinimize()

 n_docs_traversed += 1


 if n_docs_traversed % 100 == 1:
 status_tick()
 if n_docs_traversed % 5 == 1:
 self.do_commit()

 self.do_commit()
 except KeyboardInterrupt, ex:
 self.log_write('Caught keyboard interrupt, committing...')
 self.do_commit()

 I'm keeping track of the greatest docset (which would be the largest
 possible thing not able to be paged out) and its only 10,152 longs (at 8
 bytes each according to the array module's documentation) at the point 75
 seconds into the operation when the process has allocated 224 MB (on a
 cache_size_bytes of 64*1024*1024).


 On a lark I just made an empty ZMap in the interpreter and filled it with
 1M unique strings.  It took up something like 190mb.  I committed it and mem
 usage went up to 420mb.  I then ran cacheMinimize (memory stayed at 420mb).
  Then I inserted another 1M entries (strings keyed on ints) and mem usage
 went up to 820mb.  Then I committed and memory usage dropped to ~400mb and
 went back up to 833mb.  Then I ran cacheMinimize again and memory usage
 stayed there.  Does this example (totally decoupled from any other
 operations by me) make sense to experienced ZODB people?  I have really no
 functional mental model of ZODB's memory usage patterns.  I love using it,
 but I really want to find some way to get its allocations under control.
  I'm currently running this on a Macbook Pro, but it seems to be behaving
 the same way on Windows and Linux.

 I really appreciate all of the help so far, and if there're any other
 pieces of my code that might help please let me know.

 Cheers,
 Ryan

 On Mon, May 10, 2010 at 3:18 PM, Jim Fulton j...@zope.com wrote:

 On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote:
  First off, thanks everybody.  I'm implementing and testing the
 suggestions
  now.  When I said ZODB was more complicated than my solution I meant
 that
  the system was abstracting a lot more from me than my old code (because
 I
  wrote it and new exactly how to make the cache enforce its limits!).
 
   The first thing to understand is that options like cache-size and
   cache-size bytes are suggestions, not limits. :)  In particular, they
   are only enforced:
  
   - at transaction boundaries,
 
  If it's already being called at transaction boundaries how come memory
 usage
  doesn't go back down to the quota after the commit (which is only every
 25k
  documents?).

 Because Python generally doesn't return memory back to the OS. :)

 It's also possible you have a problem with one of your data
 structures.  For example if you have an array that grows effectively
 without bound, the 

Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-bytes)

2010-05-10 Thread Laurence Rowe
I think that moving to an LLTreeSet for the docset will significantly
reduce your memory usage. Non persistent objects are stored as part of
their parent persistent object's record. Each LOBTree object bucket
contains up to 60 (key, value) pairs. When the values are
non-persistent objects they are stored as part of the bucket object's
record, and so accessing any key of a bucket in a transaction brings
up to 60 docsets into memory. I would not be surprised if your program
forces most of your data into memory each batch - as most words are in
most documents.

At the very least you should move to an LLSet (essentially a single
BTree bucket). An LLTreeSet has the additional advantage of being
scalable to many values, and if under load from multiple clients you
are far less likely to see conflicts.

Laurence

On 11 May 2010 01:20, Ryan Noon rmn...@gmail.com wrote:
 P.S. About the data structures:
 wordset is a freshly unpickled python set from my old sqlite oodb thingy.
 The new docsets I'm keeping are 'L' arrays from the stdlib array module.
  I'm up for using ZODB's builtin persistent data structures if it makes a
 lot of sense to do so, but it sorta breaks my abstraction a bit and I feel
 like the memory issues I'm having are somewhat independent of the container
 data structures (as I'm having the same issue just with fixed size strings).
 Thanks!
 -Ryan

 On Mon, May 10, 2010 at 5:16 PM, Ryan Noon rmn...@gmail.com wrote:

 Hi all,
 I've incorporated everybody's advice, but I still can't get memory to obey
 cache-size-bytes.  I'm using the new 3.10 from pypi (but the same behavior
 happens on the server where I was using 3.10 from the new lucid apt repos).
 I'm going through a mapping where we take one long integer docid and map
 it to a collection of long integers (wordset) and trying to invert it into
 a mapping for each 'wordid in those wordsets to a set of the original
 docids (docset).
 I've even tried calling cacheMinimize after every single docset append,
 but reported memory to the OS never goes down and the process continues to
 allocate like crazy.
 I'm wrapping ZODB in a ZMap class that just forwards all the dictionary
 methods to the ZODB root and allows easy interchangeability with my old
 sqlite OODB abstraction.
 Here's the latest version of my code, (minorly instrumented...see below):
         try:
             max_docset_size = 0
             for docid, wordset in docid_to_wordset.iteritems():
                 for wordid in wordset:
                     if wordid_to_docset.has_key(wordid):
                         docset = wordid_to_docset[wordid]
                     else:
                         docset = array('L')
                     docset.append(docid)
                     if len(docset)  max_docset_size:
                         max_docset_size = len(docset)
                         print 'Max docset is now %d (owned by wordid %d)'
 % (max_docset_size, wordid)
                     wordid_to_docset[wordid] = docset
                     wordid_to_docset.garbage_collect()
                     wordid_to_docset.connection.cacheMinimize()

                 n_docs_traversed += 1

                 if n_docs_traversed % 100 == 1:
                     status_tick()
                 if n_docs_traversed % 5 == 1:
                     self.do_commit()

             self.do_commit()
         except KeyboardInterrupt, ex:
             self.log_write('Caught keyboard interrupt, committing...')
             self.do_commit()
 I'm keeping track of the greatest docset (which would be the largest
 possible thing not able to be paged out) and its only 10,152 longs (at 8
 bytes each according to the array module's documentation) at the point 75
 seconds into the operation when the process has allocated 224 MB (on a
 cache_size_bytes of 64*1024*1024).

 On a lark I just made an empty ZMap in the interpreter and filled it with
 1M unique strings.  It took up something like 190mb.  I committed it and mem
 usage went up to 420mb.  I then ran cacheMinimize (memory stayed at 420mb).
  Then I inserted another 1M entries (strings keyed on ints) and mem usage
 went up to 820mb.  Then I committed and memory usage dropped to ~400mb and
 went back up to 833mb.  Then I ran cacheMinimize again and memory usage
 stayed there.  Does this example (totally decoupled from any other
 operations by me) make sense to experienced ZODB people?  I have really no
 functional mental model of ZODB's memory usage patterns.  I love using it,
 but I really want to find some way to get its allocations under control.
  I'm currently running this on a Macbook Pro, but it seems to be behaving
 the same way on Windows and Linux.
 I really appreciate all of the help so far, and if there're any other
 pieces of my code that might help please let me know.
 Cheers,
 Ryan
 On Mon, May 10, 2010 at 3:18 PM, Jim Fulton j...@zope.com wrote:

 On Mon, May 10, 2010 at 5:39 PM, Ryan Noon rmn...@gmail.com wrote:
  First off, thanks everybody.  I'm