[ZODB-Dev] Towards ZODB on Python 3

2013-03-08 Thread Marius Gedminas
(Resending because I used the wrong From address and the mail got stuck
in moderation.)


Some goals, in order of decreasing priority

1. ZODB should work on Python 3

2. ZODB databases created on Python 2 should be loadable with ZODB on
   Python 3.

3. ZODB databases created on Python 3 should be loadable with ZODB on
   Python 2.


This will be kinda longish, so please settle down.


Now, ZODB is built on top of pickles.  And pickles in Python 2 know about
two kinds of strings: str and unicode.  But there are actually *three*
kinds of strings in Python-land:

  * bytes
  * unicode
  * native strings (same as bytes in Python 2, same as unicode in Python 3)

Unfortunately we cannot distinguish bytes from native strings in the
pickles produced on Python 2: both kinds are pickled as STRING, BINSTRING
or SHORT_BINSTRING opcodes.  If we assume they're native strings, we
can break pickles that contain binary data, in one of two possible ways:

  i.   assume 'ascii' and raise UnicodeDecodeError while loading

  ii.  assume 'latin-1' and silently give applications unicode objects
   where they expect strings

  iii. assume 'utf-8' and combine the disadvantages of both of the above
   methods: sometimes fail, sometimes return unicode where applications
   expect bytes

One very common example of binary data: persistent object references.

What if we break stride with the standard library pickle, do our own
pickle[1] and load BINSTRINGs as bytes?

  iv.  assume bytes [2]

Then we break *every object instance* by putting byte strings into the
instance __dict__ on Python 3:

obj.__dict__[b'attr'] = value
obj.attr
   Traceback ...
   AttributeError: ...

What if we try to detect which SHORT_BINSTRINGs are bytes and which ones
are native strings?

  v.   try to decode 'ascii', if that fails, return bytes [3]

Then we, again, get the disadvantage of approach (ii), only in a very
inconsistent manner: sometimes pickled binary data unpickles into
unicode.  Half of your OIDs are now u'\0\0\0\0\0\0\0\x7f', the other
half is b'\0\0\0\0\0\0\0\x80'.  ZODB itself can cope with that [4], but
will someone think of the childre^H^H^H^H^H applications?

What if we introduce a way for applications to specify whether they want
bytes or unicode?

  vi.  define an explicit schema of some kind for each Persistent subclass,
   e.g. _p_load_as_bytes = ('names', 'of', 'attributes'); advanced
   users can override __setstate__ and do type fixups in there

I don't know.  I haven't had the time to think this through yet.  It
sounds like a huge amount of work for everyone.

  [1] https://github.com/zopefoundation/zodbpickle
  [2] zodbpickle.pickle.Unpickler(encoding='bytes')
  [3] zodbpickle.pickle.Unpickler(encoding='ascii', errors='bytes')
  [4] this is the status quo of the 'py3' branch in the ZODB repo

That's the situation with loading.  I've implemented approach (v) in the
ZODB py3 branch, but I'm by no means certain it is acceptable.  But
that's not all, there's more fun to be had on the dumping side too!


We want pickles created by ZODB to be

  a) reasonably short
  b) round-trippable (what you dump, you get back on load)
  c) compatible with Python 2
  d) noload()able [5]

  [5] i.e. we want to be able to do garbage collection without actually
  instantiating user-defined classes (think of a ZEO server that
  doesn't have the right modules in sys.path, or standalone zodbgc
  processing), which is why we added noload() back into zodbpickle.
  noload() must be able to crawl the pickles and get back OIDs from
  persistent references.

There are problems with each of these requirements, and solutions for
those problems make the other requirements impossible to implement.

  * Python 3 pickles bytestrings using a fancy REDUCE opcode, as a
function call to codecs.encode(u'decoded bytestring', 'latin-1').
This makes them large and breaks (a), and our noload() copied from
Python 2.x stdlib is unable to handle them, breaking (d). [8]

  * Why does Python 3 pickle bytestrings this way?  Because that's the
only way to get round-trippability with Python's intepretation of
BINSTRING opcodes as unicode, if you use pickle protocols 0, 1, or
2.  Pickle protocol 3 has separate opcodes for all three kinds of
strings (bytes, unicode, native -- remember?), but it's incompatible
with Python 2, breaking requirement (c).

  * We could implement a custom pickler [6] and pickle bytestrings as
SHORT_BINSTRING, fulfilling requirement (a) and (c) and (d), but
this breaks (b), i.e. round-tripping.

  [6] zodbpickle.pickle.Pickler(bytes_as_strings=True) [7]
  [7] this is the status quo of the 'py3' branch in the ZODB repo
  [8] OTOH we could implement special support for REDUCE of
  codecs.decode() in our noload -- I almost got that working before
  Jim suggested a different approach, which is [6].

At least there's some nice symmetry: no matter if you pickle your 

Re: [ZODB-Dev] Cache warm up time

2013-03-08 Thread Claudiu Saftoiu
I'd be curious to know what your results are, whichever path you decide to
take! Might help inform me as to what might help on my server...

One thing I haven't yet understood is - how come the ZEO server itself
doesn't have a cache? It seems that would be a logical place to put one as
the ZEO server generally rarely gets restarted, at least for the use case
of running both the ZEO server and the clients on the same machine.

On Fri, Mar 8, 2013 at 1:46 AM, Roché Compaan ro...@upfrontsystems.co.zawrote:

 Thanks, there are definitely some settings relating to the persistent
 cache that I haven't tried before, simply because I've been avoiding
 them.

 I'd still be interested to know if one can leverage the Relstorage
 memcache code for a ZEO cache, so if Shane doesn't get around to it
 I'll have a stab at it myself. Loading objects from a persistent cache
 will still cause IO so to me it seems that it would be a big win to
 keep the cache in memory even while restarting.

 --
 Roché Compaan
 Upfront Systems   http://www.upfrontsystems.co.za




 On Thu, Mar 7, 2013 at 9:35 PM, Leonardo Rochael Almeida
 leoroch...@gmail.com wrote:
  This mail from Jim at this list a couple of years ago was stocked full
  of nice tips:
 
  https://mail.zope.org/pipermail/zodb-dev/2011-May/014180.html
 
  In particular:
 
  - Yes, use persistent cache. Recent versions are reliable. Make it as
  large as resonable (e.g at most the size of your packed database, at
  least the size of objects that you want to be around after a restart).
 
  - Consider using zc.zlibstorage to compress the data that's stored in
 ZODB
 
  - set drop-cache-rather-verify to true on the client (avoid long
  restart time where your client is revalidating the ZEO cache)
 
  - set invalidation-age on the server to at least an hour or two so
that you deal with being disconnected from the storage server for a
reasonable period of time without having to verify.
 
  Cheers,
 
  Leo
 
  On Thu, Mar 7, 2013 at 3:54 PM, Roché Compaan
  ro...@upfrontsystems.co.za wrote:
  We have a setup that is running just fine when the caches are warm but
  it takes several minutes after a restart before the cache warms up.
  As per usual, big catalog indexes seem to be the problem.
 
  I was wondering about two things. Firstly, in 2011 in this thread
  https://mail.zope.org/pipermail/zodb-dev/2011-October/014398.html
  about zeo.memcache, Shane said that he could adapt the caching code in
  RelStorage for ZEO. Shane do you still plan to do this? Do you think
  an instance can restart without having to reload most objects into the
  cache?
 
  Secondly, I was wondering to what extent using persistent caches can
  improve cache warm up time and if persistent caches are usable or not,
  given that at various times in the past, it was recommended that one
  try and avoid them.
 
  --
  Roché Compaan
  Upfront Systems   http://www.upfrontsystems.co.za
  ___
  For more information about ZODB, see http://zodb.org/
 
  ZODB-Dev mailing list  -  ZODB-Dev@zope.org
  https://mail.zope.org/mailman/listinfo/zodb-dev
 ___
 For more information about ZODB, see http://zodb.org/

 ZODB-Dev mailing list  -  ZODB-Dev@zope.org
 https://mail.zope.org/mailman/listinfo/zodb-dev

___
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Cache warm up time

2013-03-08 Thread Leonardo Santagada
On Fri, Mar 8, 2013 at 2:17 PM, Claudiu Saftoiu csaft...@gmail.com wrote:

 Once I know the difference I'll probably be able to answer this myself,
 but I wonder why the ZEO server doesn't do the sort of caching that allow
 the client to operate so quickly on the indices once they are loaded.


IIRC zeo not only takes bytes from the storage and put them on a socket, it
has a kind of heavy protocol for sending objects that has overhead on each
object, so lots of small objects (that are 400mb in size) take a lot more
time than sending a 400mb blob.

-- 

Leonardo Santagada
___
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Cache warm up time

2013-03-08 Thread Claudiu Saftoiu
On Fri, Mar 8, 2013 at 12:31 PM, Leonardo Santagada santag...@gmail.comwrote:


 On Fri, Mar 8, 2013 at 2:17 PM, Claudiu Saftoiu csaft...@gmail.comwrote:

 Once I know the difference I'll probably be able to answer this myself,
 but I wonder why the ZEO server doesn't do the sort of caching that allow
 the client to operate so quickly on the indices once they are loaded.


 IIRC zeo not only takes bytes from the storage and put them on a socket,
 it has a kind of heavy protocol for sending objects that has overhead on
 each object, so lots of small objects (that are 400mb in size) take a lot
 more time than sending a 400mb blob.


Ah that would make perfect sense. So ZEO and catalog indices really don't
mix well at all.
___
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Cache warm up time

2013-03-08 Thread Mikko Ohtamaa
 It would be great if there was a way to advise ZODB in advance that
 certain objects would be required so it could fetch multiple object
 states in a single request to the storage server.

I saw a ZODB prefetching discussion long time ago, but maybe the
authors themselves can weight in here

http://www.python.org/~jeremy/weblog/030418.html


-- 
Mikko Ohtamaa
http://opensourcehacker.com
http://twitter.com/moo9000
___
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Cache warm up time

2013-03-08 Thread Roché Compaan
A very simple alternative to prefetching would be to load the whole DB
into memory indiscriminately, if it is configured to do so. This way,
you can store your catalog in a separate db and request all of it from
the ZEO server and cache it straight away.

I'm still partial to a memcached cache that can survive a restart. The
first prize would be if it's possible to share the cache between zeo
clients.

--
Roché Compaan
Upfront Systems   http://www.upfrontsystems.co.za


On Fri, Mar 8, 2013 at 7:50 PM, Laurence Rowe l...@lrowe.co.uk wrote:
 On 8 March 2013 09:38, Claudiu Saftoiu csaft...@gmail.com wrote:
 On Fri, Mar 8, 2013 at 12:31 PM, Leonardo Santagada santag...@gmail.com
 wrote:


 On Fri, Mar 8, 2013 at 2:17 PM, Claudiu Saftoiu csaft...@gmail.com
 wrote:

 Once I know the difference I'll probably be able to answer this myself,
 but I wonder why the ZEO server doesn't do the sort of caching that allow
 the client to operate so quickly on the indices once they are loaded.


 IIRC zeo not only takes bytes from the storage and put them on a socket,
 it has a kind of heavy protocol for sending objects that has overhead on
 each object, so lots of small objects (that are 400mb in size) take a lot
 more time than sending a 400mb blob.


 Ah that would make perfect sense. So ZEO and catalog indices really don't
 mix well at all.

 The slowdown is largely because ZODB only loads objects one at a time.
 Loading a large catalogue requires paying that latency (network +
 software) each time, a 400mb of catalogue data may well equate to
 something like 1 objects, and therefore 1 loads in series.
 Once the data is loaded into the object cache you only need to fetch
 invalidated objects.

 It would be great if there was a way to advise ZODB in advance that
 certain objects would be required so it could fetch multiple object
 states in a single request to the storage server.

 Laurence
 ___
 For more information about ZODB, see http://zodb.org/

 ZODB-Dev mailing list  -  ZODB-Dev@zope.org
 https://mail.zope.org/mailman/listinfo/zodb-dev
___
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] RelStorage 1.5.1 and persistent 4.0.5+ Incompatible (patch)

2013-03-08 Thread Shane Hathaway

On 03/07/2013 10:48 AM, jason.mad...@nextthought.com wrote:


On Mar 7, 2013, at 11:35, Sean Upton sdup...@gmail.com wrote:


On Thu, Mar 7, 2013 at 7:31 AM,  jason.mad...@nextthought.com
wrote:

I only spotted two uses of this assumption in RelStrorage, the
above-mentioned `_prepare_tid`, plus `pack`. The following simple
patch to change those places to use `raw` makes our own internal
tests (python2.7, MySQL) pass.


Why not fork https://github.com/zodb/relstorage and submit a pull
request?


Because I didn't realize that repository existed :) I will do so,
thanks.

On that note, though, the PyPI page still links to the SVN repository
at http://svn.zope.org/relstorage/trunk/ (which is also what comes up
in a Google search), and that repository still has all its contents;
it's missing the 'MOVED_TO_GITHUB' file that's commonly there when
the project has been moved (e.g., [1]). With a bit of searching I
found the announcement on this list that development had been
moved[2], but to a first glance it looks like SVN is still the place
to be. If the move is complete, maybe it would be good to replace the
SVN contents with the MOVED_TO_GITHUB pointer?


Thanks for the patch and suggestion.  I intend to handle RelStorage pull 
requests during/around PyCon next week. :-)


Shane

___
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev


[ZODB-Dev] transaction: synchronizer newTransaction() behavior

2013-03-08 Thread Siddhartha Kasivajhula
Hi there,
I've been discussing this issue with Laurence Rowe on the pylons-dev
mailing list, and he suggested bringing it up here.

I'm writing a MongoDB data manager for the python transaction package:
https://github.com/countvajhula/mongomorphism
I noticed that for a synchronizer, the beforeCompletion() and
afterCompletion() methods are always called once the synch has been
registered, but the newTransaction() method is only called when an explicit
call to transaction.begin() is made. Since it's possible for transactions
to be started without this explicit call, I was wondering if there was a
good reason why these two cases (explicitly vs implicitly begun
transactions) would be treated differently. That is, should the following
two cases not be equivalent, and therefore should the newTransaction()
method be called in both cases:

(1)
t = transaction.get()
t.join(my_dm)
..some changes to the data..
transaction.commit()

and:

(2)
transaction.begin()
t = transaction.get()
t.join(my_dm)
..some changes to the data..
transaction.commit()

In my mongo dm implementation, I am using the synchronizer to do some
initialization before each transaction gets underway, and am currently
requiring explicit calls to transaction.begin() at the start of each
transaction. Unfortunately, it appears that other third party libraries
using the transaction library may not be calling begin() explicitly, and in
particular my data manager doesn't work when used with pyramid_tm.

Another thing I noticed was that a synchronizer cannot be registered like
so:
transaction.manager.registerSynch(MySynch())
.. and can only be registered like this:
synch = MySynch()
transaction.manager.registerSynch(synch)

... which I'm told is due to MySynch() being stored in a WeakSet which
means it gets garbage collected. Currently this means that I'm retaining a
reference to the synch as a global that I never use. Just seems a bit
contrived so thought I'd mention that as well, in case there's anything
that can be done about that.

Any thoughts?

Thanks!
-Sid
___
For more information about ZODB, see http://zodb.org/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
https://mail.zope.org/mailman/listinfo/zodb-dev