Re: [openstack-dev] [oslo] strutils: enhance safe_decode() and safe_encode()
On 21/05/14 11:32 -0400, Doug Hellmann wrote: On Thu, May 15, 2014 at 11:41 AM, Victor Stinner victor.stin...@enovance.com wrote: Hi, The functions safe_decode() and safe_encode() have been ported to Python 3, and changed more than once. IMO we can still improve these functions to make them more reliable and easier to use. (1) My first concern is that these functions try to guess user expectation about encodings. They use sys.stdin.encoding or sys.getdefaultencoding() as the default encoding to decode, but this encoding depends on the locale encoding (stdin encoding), on stdin (is stdin a TTY? is stdin mocked?), and on the Python major version. IMO the default encoding should be UTF-8 because most OpenStack components expect this encoding. Or maybe users want to display data to the terminal, and so the locale encoding should be used? In this case, locale.getpreferredencoding() would be more reliable than sys.stdin.encoding. From what I can see, most uses of the module are in the client programs. If using locale to find a default encoding is the best approach, perhaps we should go ahead and make the change you propose. One place I see safe_decode() used in a questionable way is in heat in heat/engine/parser.py where validation errors are being re-raised as StackValidationFailed (line 376 in my version). It's not clear why the message is processed the way it is, so I would want to understand the history before proposing a change there. The original intent for these 2 functions was to provide a reliable way to encode/decode the input. As already mentioned in this thread, it's not good to assume what the best encoding for every case is and I would also prefer to keep these functions generci - as in, not thought just for client libraries. We use this module in Glance as well, unfortunately, not as much as I'd like. I would prefer the improved-encoding guess to happen outside these functions, if it's meant for client library. For example, glanceclient could use `getpreferredencoding` and pass that to safe_(encode|decode). Flavio (2) My second concern is that safe_encode(bytes, incoming, encoding) transcodes the bytes string from incoming to encoding if these two encodings are different. When I port code to Python 3, I'm looking for a function to replace this common pattern: if isinstance(data, six.text_type): data = data.encode(encoding) I don't want to modify data encoding if it is already a bytes string. So I would prefer to have: def safe_encode(data, encoding='utf-8'): if isinstance(data, six.text_type): data = data.encode(encoding) return data Changing safe_encode() like this will break applications relying on the transcode feature (incoming = encoding). If such usage exists, I suggest to add a new function (ex: transcode ?) with an API fitting this use case. For example, the incoming encoding would be mandatory. Is there really applications using the incoming parameter? The only place I see that parameter used in integrated projects is in the tests for the module. I didn't check the non-integrated projects. Given its symmetry with safe_decode(), I don't really see a problem with the current name. Something like the shortcut you propose is present in safe_encode(), so I'm not sure what benefit a new function brings? +1 Flavio P.S: I'm working on graduating strutils from the incubator. I'm glad you brought this up. I'm almost done with the graduation thing. -- @flaper87 Flavio Percoco pgpHNaKdhG35L.pgp Description: PGP signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo] strutils: enhance safe_decode() and safe_encode()
On Thu, May 15, 2014 at 11:41 AM, Victor Stinner victor.stin...@enovance.com wrote: Hi, The functions safe_decode() and safe_encode() have been ported to Python 3, and changed more than once. IMO we can still improve these functions to make them more reliable and easier to use. (1) My first concern is that these functions try to guess user expectation about encodings. They use sys.stdin.encoding or sys.getdefaultencoding() as the default encoding to decode, but this encoding depends on the locale encoding (stdin encoding), on stdin (is stdin a TTY? is stdin mocked?), and on the Python major version. IMO the default encoding should be UTF-8 because most OpenStack components expect this encoding. Or maybe users want to display data to the terminal, and so the locale encoding should be used? In this case, locale.getpreferredencoding() would be more reliable than sys.stdin.encoding. From what I can see, most uses of the module are in the client programs. If using locale to find a default encoding is the best approach, perhaps we should go ahead and make the change you propose. One place I see safe_decode() used in a questionable way is in heat in heat/engine/parser.py where validation errors are being re-raised as StackValidationFailed (line 376 in my version). It's not clear why the message is processed the way it is, so I would want to understand the history before proposing a change there. (2) My second concern is that safe_encode(bytes, incoming, encoding) transcodes the bytes string from incoming to encoding if these two encodings are different. When I port code to Python 3, I'm looking for a function to replace this common pattern: if isinstance(data, six.text_type): data = data.encode(encoding) I don't want to modify data encoding if it is already a bytes string. So I would prefer to have: def safe_encode(data, encoding='utf-8'): if isinstance(data, six.text_type): data = data.encode(encoding) return data Changing safe_encode() like this will break applications relying on the transcode feature (incoming = encoding). If such usage exists, I suggest to add a new function (ex: transcode ?) with an API fitting this use case. For example, the incoming encoding would be mandatory. Is there really applications using the incoming parameter? The only place I see that parameter used in integrated projects is in the tests for the module. I didn't check the non-integrated projects. Given its symmetry with safe_decode(), I don't really see a problem with the current name. Something like the shortcut you propose is present in safe_encode(), so I'm not sure what benefit a new function brings? Doug So, what do you think about that? Victor ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo] strutils: enhance safe_decode() and safe_encode()
On 05/15/2014 11:41 AM, Victor Stinner wrote: Hi, The functions safe_decode() and safe_encode() have been ported to Python 3, and changed more than once. IMO we can still improve these functions to make them more reliable and easier to use. (1) My first concern is that these functions try to guess user expectation about encodings. They use sys.stdin.encoding or sys.getdefaultencoding() as the default encoding to decode, but this encoding depends on the locale encoding (stdin encoding), on stdin (is stdin a TTY? is stdin mocked?), and on the Python major version. IMO the default encoding should be UTF-8 because most OpenStack components expect this encoding. Or maybe users want to display data to the terminal, and so the locale encoding should be used? In this case, locale.getpreferredencoding() would be more reliable than sys.stdin.encoding. The problem is you can't know the correct encoding to use until you know the encoding of the IO stream, therefore I don't think you can correctly write a generic encode/decode functions. What if you're trying to send the output to multiple IO streams potentially with different encodings? Think that's far fetched? Nope, it's one of the nastiest and common problems in Python2. The default encoding differs depending on whether the IO target is a tty or not. Therefore code that works fine when written to the terminal blows up with encoding errors when redirected to a file (because the TTY probably has UTF-8 and all other encodings default to ASCII due to sys.defaultencoding). Another problem is that Python2 default encoding is ASCII but in Python3 it's UTF-8 (IMHO the default encoding in Python2 should have been UTF-8, that fact it was set to ASCII is the cause of 99% of the encoding exceptions in Python2). Given that you don't know what the encoding of the IO stream is I don't think you should base it on the locale nor sys.stdin. Rather I think we should just agree everything is UTF-8. If that messes up someones terminal output I think it's fair to say if you're running OpenStack you'll need to switch to UTF-8. Anything else requires way more knowledge than we have available in a generic function. Solving this so the encodings match for each and every IO stream is very complicated, note Python3 still punts on this. -- John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo] strutils: enhance safe_decode() and safe_encode()
On Wed, May 21, 2014 at 12:30 PM, John Dennis jden...@redhat.com wrote: On 05/15/2014 11:41 AM, Victor Stinner wrote: Hi, The functions safe_decode() and safe_encode() have been ported to Python 3, and changed more than once. IMO we can still improve these functions to make them more reliable and easier to use. (1) My first concern is that these functions try to guess user expectation about encodings. They use sys.stdin.encoding or sys.getdefaultencoding() as the default encoding to decode, but this encoding depends on the locale encoding (stdin encoding), on stdin (is stdin a TTY? is stdin mocked?), and on the Python major version. IMO the default encoding should be UTF-8 because most OpenStack components expect this encoding. Or maybe users want to display data to the terminal, and so the locale encoding should be used? In this case, locale.getpreferredencoding() would be more reliable than sys.stdin.encoding. The problem is you can't know the correct encoding to use until you know the encoding of the IO stream, therefore I don't think you can correctly write a generic encode/decode functions. What if you're trying to send the output to multiple IO streams potentially with different encodings? Think that's far fetched? Nope, it's one of the nastiest and common problems in Python2. The default encoding differs depending on whether the IO target is a tty or not. Therefore code that works fine when written to the terminal blows up with encoding errors when redirected to a file (because the TTY probably has UTF-8 and all other encodings default to ASCII due to sys.defaultencoding). Another problem is that Python2 default encoding is ASCII but in Python3 it's UTF-8 (IMHO the default encoding in Python2 should have been UTF-8, that fact it was set to ASCII is the cause of 99% of the encoding exceptions in Python2). Given that you don't know what the encoding of the IO stream is I don't think you should base it on the locale nor sys.stdin. Rather I think we should just agree everything is UTF-8. If that messes up someones terminal output I think it's fair to say if you're running OpenStack you'll need to switch to UTF-8. Anything else requires way more knowledge than we have available in a generic function. Solving this so the encodings match for each and every IO stream is very complicated, note Python3 still punts on this. Unfortunately we can't just agree to a single encoding in all cases. Lots of people use encodings other than UTF-8 for terminals, and that's where these functions are most frequently used. Doug -- John ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [oslo] strutils: enhance safe_decode() and safe_encode()
Hi, The functions safe_decode() and safe_encode() have been ported to Python 3, and changed more than once. IMO we can still improve these functions to make them more reliable and easier to use. (1) My first concern is that these functions try to guess user expectation about encodings. They use sys.stdin.encoding or sys.getdefaultencoding() as the default encoding to decode, but this encoding depends on the locale encoding (stdin encoding), on stdin (is stdin a TTY? is stdin mocked?), and on the Python major version. IMO the default encoding should be UTF-8 because most OpenStack components expect this encoding. Or maybe users want to display data to the terminal, and so the locale encoding should be used? In this case, locale.getpreferredencoding() would be more reliable than sys.stdin.encoding. (2) My second concern is that safe_encode(bytes, incoming, encoding) transcodes the bytes string from incoming to encoding if these two encodings are different. When I port code to Python 3, I'm looking for a function to replace this common pattern: if isinstance(data, six.text_type): data = data.encode(encoding) I don't want to modify data encoding if it is already a bytes string. So I would prefer to have: def safe_encode(data, encoding='utf-8'): if isinstance(data, six.text_type): data = data.encode(encoding) return data Changing safe_encode() like this will break applications relying on the transcode feature (incoming = encoding). If such usage exists, I suggest to add a new function (ex: transcode ?) with an API fitting this use case. For example, the incoming encoding would be mandatory. Is there really applications using the incoming parameter? So, what do you think about that? Victor ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev