On 01:07 pm, wolfgang....@rohdewald.de wrote:
Am Montag, 8. September 2014, 12:04:46 schrieb exar...@twistedmatrix.com:
PB supports unicode perfectly well and has for many years. This is why
I asked which specific part of PB has a problem.

PB transfers names of methods and classes as bytes, not as unicode.

PB actually transfers *everything* as bytes. ;) Bytes are the only thing you can send over a socket. If you have anything else - integers, unicode, whatever - you have to encode them as bytes first.

As far as I know, the PB protocol is not specified apart from the implementation in Twisted (and this being the only implementation (I am intentionally disregarding TwistedJava) it must serve as the specification, I think).

What you meant above, I think, is that PB represents method names as bytes at the banana layer. That is, when you want to call a remote method, you indicate its name by supplying bytes to the banana encoding layer - not unicode (which is good because banana doesn't actually support unicode at all, that's a jelly feature).

This does indeed mean we don't simply want to start sending unicode to refer to methods by name - because we can't! At least, not unless we extend banana to support a new type which we probably don't want to do - that would be another incompatible protocol change and so not allowed (since it could break interoperability between different implementations of PB).

So, it is necessary to continue to represent method names using bytes.

This is fairly easily done. On Python 3, encode any unicode strings which represent method names (using a well-known encoding, probably UTF-8) when making the call and decode them in the same way when dispatching those calls.

This can almost be done at the application level:

   # Some Python 3 code
   def remote_ä(self):
       pass

   ref.callRemote(u"ä".encode("utf-8"))

except that Python 3 has actually changed to enforce the type of the second argument of getattr - if it is not a unicode string then a TypeError is raised - so it's not possible to make the decoding step happen (which one might otherwise have done using `__getattr__` or by adding the encoded name of the method to the class dictionary).

So if it is going to be supported in Twisted's PB API then that support probably needs to be in Twisted's PB implementation. The same general idea applies, though. Just move the encoding into the implementation of `callRemote`:

   def callRemote(self, _name, ...):
       ... _name.encode("utf-8") ...

And add a corresponding decode to the other side (probably in `_recvMessage`).

This would make the Python 3 PB API be "method names are unicode strings" which makes sense considering the decisions that were made for Python 3. Note that it does not change the wire protocol - method names are still bytes at the banana level. Or does it? These bytes were previously always an ASCII subset. Is expanding out of the ASCII range an incompatible change?

What could this break?

Let's say you have Python 2 talking to Python 2. It's already possible to construct a method call like this:

   class Foo(Referenceable):
       def remote_foo(self):
           pass
   setattr(Foo, u"remote_fooä".encode("utf-8"), Foo.remote_foo)

   ...

   ref.callRemote(u"remote_fooä".encode("utf-8"))

This actually works. Python 2 doesn't much care about how you name your class attributes. PB doesn't care that the high bit is set in one or more of the bytes in the method name. It all just works.

So let's say you have Python 2 talking to Python 3 instead. In Python 3, you can't do that setattr() call (the language and runtime disallow it). But you can have `def remote_fooä(self)` instead. If PB on Python 3 decodes the method name before dispatching it (using UTF-8) then again things work.

And if you reverse the situation and PB on Python 3 encodes the method name before sending it, then Python 2 is still happen because it can operate on that UTF-8 encoded byte string.

Finally, if Python 3 talks to Python 3 then it also works because the sending side encodes and the receiving side decodes.

So we get to make a judgement call here, I think. Without a specification there's no objectively correct answer. So, because the current implementation is actually perfectly compatible with non-ASCII bytes - even though the intent is clearly that you would never have those - combined with the point that I made above, that there are no other PB implementations, I suspect it's fine to expand beyond ASCII here because it won't actually break any real world programs.

The only case that I can think of that actually would be a problem is the case where someone is already sending non-ASCII, non-UTF-8 method names around. These might decode wrong or might fail to decode at all. I don't think this is likely enough to worry about - but maybe someone who is doing this will speak up and prove me wrong. ;)

And to repeat myself a bit, none of this should change the Python 2 PB API. It should continue accepting bytes - because that's what it always accepted. Separately, we could introduce a new feature to support unicode on Python 2. This would be done in the usual way for introducing new features into any Twisted APIs (and there aren't reasonable backwards compatibility considerations here as far as I can tell). The point would be to make it a little more convenient for Python 2 applications to interact with other PB applications that have decided to use unicode method names.

Lastly, on another topic, I am subscribed to the mailing list - you don't have to cc me on your replies.

Jean-Paul

Which is logical since PY2 does not support unicode identifiers,
and bytes is already a native PY2 string. Unicode is only used
for content. It not yet always clear to me what is content and
what is a formal string like method names or the *_atom strings
which must be bytes, this needs more testing.

I guess I should patch banana.py such that it dumps all it encodes
or decodes into one file, so I can compare output from PY2/PY3 tests.

I was assuming that suddenly transferring method names as unicode
would really be a break of wire protocol stability, or do you
think otherwise? If you think this is acceptable, I will check
if the existing twisted code can handle getting those as unicode
without source code changes. Not sure. Just tested this with
Python2.6, and I am surprised that it works:
getattr(A,u'x')
<unbound method A.x>


Supporting PEP3131 would only introduce a backward-incompatibility.

Of course you are right that this is not part of porting.

Right now I have a long list of small unsorted git commits, I will
have to do a lot of reshuffling and cleaning before I will ask
you how to get it into the official codebase.
Not all of those commits are strictly porting, some just clean
the code, making the porting commits simpler.
A failing unit test or a minimal example (<http://sscce.org/>) would
communicate this most clearly, but perhaps you can just mention a
specific API and give an incomplete example of how it will fail when it
runs up against the changes defined by PEP 3131.

see my first mail in this thread: take test_pb.py, rename getSimple
to getSimpleä, run the test.

--
Wolfgang

_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Reply via email to