Re: [Twisted-Python] PEP3131: non-ascii identifiers

exarkun Mon, 08 Sep 2014 07:29:30 -0700

On 01:07 pm, wolfgang....@rohdewald.de wrote:

Am Montag, 8. September 2014, 12:04:46 schriebexar...@twistedmatrix.com:
PB supports unicode perfectly well and has for many years. This iswhy
I asked which specific part of PB has a problem.
PB transfers names of methods and classes as bytes, not as unicode.

PB actually transfers *everything* as bytes. ;) Bytes are the onlything you can send over a socket. If you have anything else - integers,unicode, whatever - you have to encode them as bytes first.

As far as I know, the PB protocol is not specified apart from theimplementation in Twisted (and this being the only implementation (I amintentionally disregarding TwistedJava) it must serve as thespecification, I think).

What you meant above, I think, is that PB represents method names asbytes at the banana layer. That is, when you want to call a remotemethod, you indicate its name by supplying bytes to the banana encodinglayer - not unicode (which is good because banana doesn't actuallysupport unicode at all, that's a jelly feature).

This does indeed mean we don't simply want to start sending unicode torefer to methods by name - because we can't! At least, not unless weextend banana to support a new type which we probably don't want to do -that would be another incompatible protocol change and so not allowed(since it could break interoperability between different implementationsof PB).


So, it is necessary to continue to represent method names using bytes.

This is fairly easily done. On Python 3, encode any unicode stringswhich represent method names (using a well-known encoding, probablyUTF-8) when making the call and decode them in the same way whendispatching those calls.


This can almost be done at the application level:

   # Some Python 3 code
   def remote_ä(self):
       pass

   ref.callRemote(u"ä".encode("utf-8"))

except that Python 3 has actually changed to enforce the type of thesecond argument of getattr - if it is not a unicode string then aTypeError is raised - so it's not possible to make the decoding stephappen (which one might otherwise have done using `__getattr__` or byadding the encoded name of the method to the class dictionary).

So if it is going to be supported in Twisted's PB API then that supportprobably needs to be in Twisted's PB implementation. The same generalidea applies, though. Just move the encoding into the implementation of`callRemote`:


   def callRemote(self, _name, ...):
       ... _name.encode("utf-8") ...

And add a corresponding decode to the other side (probably in`_recvMessage`).

This would make the Python 3 PB API be "method names are unicodestrings" which makes sense considering the decisions that were made forPython 3. Note that it does not change the wire protocol - method namesare still bytes at the banana level. Or does it? These bytes werepreviously always an ASCII subset. Is expanding out of the ASCII rangean incompatible change?


What could this break?

Let's say you have Python 2 talking to Python 2. It's already possibleto construct a method call like this:


   class Foo(Referenceable):
       def remote_foo(self):
           pass
   setattr(Foo, u"remote_fooä".encode("utf-8"), Foo.remote_foo)

   ...

   ref.callRemote(u"remote_fooä".encode("utf-8"))

This actually works. Python 2 doesn't much care about how you name yourclass attributes. PB doesn't care that the high bit is set in one ormore of the bytes in the method name. It all just works.

So let's say you have Python 2 talking to Python 3 instead. In Python3, you can't do that setattr() call (the language and runtime disallowit). But you can have `def remote_fooä(self)` instead. If PB on Python3 decodes the method name before dispatching it (using UTF-8) then againthings work.

And if you reverse the situation and PB on Python 3 encodes the methodname before sending it, then Python 2 is still happen because it canoperate on that UTF-8 encoded byte string.

Finally, if Python 3 talks to Python 3 then it also works because thesending side encodes and the receiving side decodes.

So we get to make a judgement call here, I think. Without aspecification there's no objectively correct answer. So, because thecurrent implementation is actually perfectly compatible with non-ASCIIbytes - even though the intent is clearly that you would never havethose - combined with the point that I made above, that there are noother PB implementations, I suspect it's fine to expand beyond ASCIIhere because it won't actually break any real world programs.

The only case that I can think of that actually would be a problem isthe case where someone is already sending non-ASCII, non-UTF-8 methodnames around. These might decode wrong or might fail to decode at all.I don't think this is likely enough to worry about - but maybe someonewho is doing this will speak up and prove me wrong. ;)

And to repeat myself a bit, none of this should change the Python 2 PBAPI. It should continue accepting bytes - because that's what it alwaysaccepted. Separately, we could introduce a new feature to supportunicode on Python 2. This would be done in the usual way forintroducing new features into any Twisted APIs (and there aren'treasonable backwards compatibility considerations here as far as I cantell). The point would be to make it a little more convenient forPython 2 applications to interact with other PB applications that havedecided to use unicode method names.

Lastly, on another topic, I am subscribed to the mailing list - youdon't have to cc me on your replies.


Jean-Paul


Which is logical since PY2 does not support unicode identifiers,
and bytes is already a native PY2 string. Unicode is only used
for content. It not yet always clear to me what is content and
what is a formal string like method names or the *_atom strings
which must be bytes, this needs more testing.

I guess I should patch banana.py such that it dumps all it encodes
or decodes into one file, so I can compare output from PY2/PY3 tests.

I was assuming that suddenly transferring method names as unicode
would really be a break of wire protocol stability, or do you
think otherwise? If you think this is acceptable, I will check
if the existing twisted code can handle getting those as unicode
without source code changes. Not sure. Just tested this with
Python2.6, and I am surprised that it works:

getattr(A,u'x')

<unbound method A.x>


Supporting PEP3131 would only introduce a backward-incompatibility.

Of course you are right that this is not part of porting.

Right now I have a long list of small unsorted git commits, I will
have to do a lot of reshuffling and cleaning before I will ask
you how to get it into the official codebase.
Not all of those commits are strictly porting, some just clean
the code, making the porting commits simpler.

A failing unit test or a minimal example (<http://sscce.org/>) would
communicate this most clearly, but perhaps you can just mention a

specific API and give an incomplete example of how it will fail whenit

runs up against the changes defined by PEP 3131.


see my first mail in this thread: take test_pb.py, rename getSimple
to getSimpleä, run the test.

--
Wolfgang

_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Re: [Twisted-Python] PEP3131: non-ascii identifiers

Reply via email to