Re: [Python-Dev] Add transform() and untranform() methods
On 16 Nov 2013 10:47, Victor Stinner victor.stin...@gmail.com wrote: 2013/11/16 Nick Coghlan ncogh...@gmail.com: To address Serhiy's security concerns with the compression codecs (which are technically independent of the question of restoring the aliases), I also plan to document how to systematically blacklist particular codecs in an application by setting attributes on the encodings module and/or appropriate entries in sys.modules. I would be simpler and safer to blacklist bytes=bytes and str=str codecs from bytes.decode() and str.encode() directly. Marc Andre Lemburg proposed to add new attributes in CodecInfo to specify input and output types. Yes, but that type compatibility introspection is a change for 3.5 at the earliest (although I commented on http://bugs.python.org/issue19619 with two alternate suggestions that I think would be reasonable to implement for 3.4). Everything codec related that I am doing at the moment is about improving the state of 3.4 and source compatible 2/3 code. Proposals for further 3.5+ only improvements are relevant only in the sense that I don't want to lock us out from future improvements (which is why my main aim is to clarify the status quo, with the only functional changes related to restoring feature parity with Python 2 for non-Unicode codecs). The only functional *change* I'd still like to make for 3.4 is to restore the shorthand aliases for the non-Unicode codecs (to ease the migration for folks coming from Python 2), but this thread has convinced me I likely need to write the PEP *before* doing that, and I still have to integrate ensurepip into pyvenv before the beta 1 deadline. So unless you and Victor are prepared to +1 the restoration of the codec aliases (closing issue 7475) in anticipation of that codecs infrastructure documentation PEP, the change to restore the aliases probably won't be in 3.4. (I *might* get the PEP written in time regardless, but I'm not betting on it at this point). Using StackOverflow search engine, I found some posts where people asks for hex codec on Python 3. There are two answers: use binascii module or use codecs.encode(). So even if codecs.encode() was never documented, it looks like it is used. So I now agree that documenting it would not make the situation worse. Aye, that was my conclusion (hence my proposal on issue 7475 back in April). Can I take that observation as a +1 for restoring the aliases as well? (That and more efficiently rejecting the non-Unicode codecs from str.encode, bytes.decode and bytearray.decode are the only aspects of this subject to the beta 1 deadline - we can be a bit more leisurely when it comes to working out the details of the docs updates) Adding transform()/untransform() method to bytes and str is a non trivial change and not everybody likes them. Anyway, it's too late for Python 3.4. In my opinion, the best option is to add new input_type/output_type attributes to CodecInfo right now, and modify the codecs so abc.encode(hex) raises a LookupError (instead of tricky error message with some evil low-level hacks on the traceback and the exception, which is my initial concern in this mail thread). It fixes also the security vulnerability. The C level code for catching the input type errors only looks evil because: - the C level equivalent of exception Exception as Y: raise X from Y is just plain ugly in the first place - the chaining includes a *lot* of checks of the original exception to ensure that no data is lost by raising a new instance of the same exception Type and chaining - it chains ValueError, AttributeError and any other currently stateless (aside from a str description) error the codec might throw, not just input type validation errors (it deliberately doesn't chain stateful errors as doing so might be backwards incompatible with existing error handling). However, the ugliness of that code is the reason I'm intrigued by the possibility of traceback annotations as a potentially cleaner solution than trying to seamlessly wrap exceptions with a new one that adds more context information. While I think the gain in codec debuggability is worth it in this case, my concern over the complexity and the current limitations are the reason I didn't make it a public C API. To keep backward compatibility (even with custom codecs registered manually), if input_type/output_type is not defined, we should consider that the codec is a classical text encoding (encode str=bytes, decode bytes=str). Without an already existing ByteSequence ABC , it isn't feasible to propose and implement this completely in the 3.4 time frame (since you would need such an ABC to express the input type accepted by our Unicode and binary codecs - the only one that wouldn't need it is rot_13, since that's str-str). However, the output types could be expressed solely as concrete types, and that's all we need for the blacklist (since we could replace the current
Re: [Python-Dev] Add transform() and untranform() methods
Why not using str type for str and str subtypes, and bytes type for bytes and bytes-like object (bytearray, memoryview)? I don't think that we need an ABC here. Victor Le 16 nov. 2013 10:44, Nick Coghlan ncogh...@gmail.com a écrit : On 16 Nov 2013 10:47, Victor Stinner victor.stin...@gmail.com wrote: 2013/11/16 Nick Coghlan ncogh...@gmail.com: To address Serhiy's security concerns with the compression codecs (which are technically independent of the question of restoring the aliases), I also plan to document how to systematically blacklist particular codecs in an application by setting attributes on the encodings module and/or appropriate entries in sys.modules. I would be simpler and safer to blacklist bytes=bytes and str=str codecs from bytes.decode() and str.encode() directly. Marc Andre Lemburg proposed to add new attributes in CodecInfo to specify input and output types. Yes, but that type compatibility introspection is a change for 3.5 at the earliest (although I commented on http://bugs.python.org/issue19619 with two alternate suggestions that I think would be reasonable to implement for 3.4). Everything codec related that I am doing at the moment is about improving the state of 3.4 and source compatible 2/3 code. Proposals for further 3.5+ only improvements are relevant only in the sense that I don't want to lock us out from future improvements (which is why my main aim is to clarify the status quo, with the only functional changes related to restoring feature parity with Python 2 for non-Unicode codecs). The only functional *change* I'd still like to make for 3.4 is to restore the shorthand aliases for the non-Unicode codecs (to ease the migration for folks coming from Python 2), but this thread has convinced me I likely need to write the PEP *before* doing that, and I still have to integrate ensurepip into pyvenv before the beta 1 deadline. So unless you and Victor are prepared to +1 the restoration of the codec aliases (closing issue 7475) in anticipation of that codecs infrastructure documentation PEP, the change to restore the aliases probably won't be in 3.4. (I *might* get the PEP written in time regardless, but I'm not betting on it at this point). Using StackOverflow search engine, I found some posts where people asks for hex codec on Python 3. There are two answers: use binascii module or use codecs.encode(). So even if codecs.encode() was never documented, it looks like it is used. So I now agree that documenting it would not make the situation worse. Aye, that was my conclusion (hence my proposal on issue 7475 back in April). Can I take that observation as a +1 for restoring the aliases as well? (That and more efficiently rejecting the non-Unicode codecs from str.encode, bytes.decode and bytearray.decode are the only aspects of this subject to the beta 1 deadline - we can be a bit more leisurely when it comes to working out the details of the docs updates) Adding transform()/untransform() method to bytes and str is a non trivial change and not everybody likes them. Anyway, it's too late for Python 3.4. In my opinion, the best option is to add new input_type/output_type attributes to CodecInfo right now, and modify the codecs so abc.encode(hex) raises a LookupError (instead of tricky error message with some evil low-level hacks on the traceback and the exception, which is my initial concern in this mail thread). It fixes also the security vulnerability. The C level code for catching the input type errors only looks evil because: - the C level equivalent of exception Exception as Y: raise X from Y is just plain ugly in the first place - the chaining includes a *lot* of checks of the original exception to ensure that no data is lost by raising a new instance of the same exception Type and chaining - it chains ValueError, AttributeError and any other currently stateless (aside from a str description) error the codec might throw, not just input type validation errors (it deliberately doesn't chain stateful errors as doing so might be backwards incompatible with existing error handling). However, the ugliness of that code is the reason I'm intrigued by the possibility of traceback annotations as a potentially cleaner solution than trying to seamlessly wrap exceptions with a new one that adds more context information. While I think the gain in codec debuggability is worth it in this case, my concern over the complexity and the current limitations are the reason I didn't make it a public C API. To keep backward compatibility (even with custom codecs registered manually), if input_type/output_type is not defined, we should consider that the codec is a classical text encoding (encode str=bytes, decode bytes=str). Without an already existing ByteSequence ABC , it isn't feasible to propose and implement this completely in the 3.4 time
Re: [Python-Dev] Add transform() and untranform() methods
On 16 November 2013 20:45, Victor Stinner victor.stin...@gmail.com wrote: Why not using str type for str and str subtypes, and bytes type for bytes and bytes-like object (bytearray, memoryview)? I don't think that we need an ABC here. We'd only need an ABC if info was added for supported input types. However, that's not necessary since encodes_to and decodes_to are enough to identify Unicode encodings: encodes_to in (None, bytes) and decodes_to in (None, str), so we don't need to track input type support at all if the main question we want to answer is is this a Unicode codec or not?. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 16.11.2013 01:47, Victor Stinner wrote: Adding transform()/untransform() method to bytes and str is a non trivial change and not everybody likes them. Anyway, it's too late for Python 3.4. Just to clarify: I still like the idea of adding those methods. I just don't see what this addition has to do with the codecs.encode()/ .decode() functions. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 16 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2013-11-19: Python Meeting Duesseldorf ... 3 days to go : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Sat, 16 Nov 2013 19:44:51 +1000 Nick Coghlan ncogh...@gmail.com wrote: Aye, that was my conclusion (hence my proposal on issue 7475 back in April). Can I take that observation as a +1 for restoring the aliases as well? I see no harm in restoring the aliases personally, so +1 from me. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 16 November 2013 21:38, Nick Coghlan ncogh...@gmail.com wrote: On 16 November 2013 20:45, Victor Stinner victor.stin...@gmail.com wrote: Why not using str type for str and str subtypes, and bytes type for bytes and bytes-like object (bytearray, memoryview)? I don't think that we need an ABC here. We'd only need an ABC if info was added for supported input types. However, that's not necessary since encodes_to and decodes_to are enough to identify Unicode encodings: encodes_to in (None, bytes) and decodes_to in (None, str), so we don't need to track input type support at all if the main question we want to answer is is this a Unicode codec or not?. I realised I misunderstood your proposal because of the field names you initially suggested. I've now proposed a variation with different field names (encodes_to instead of output_type and decodes_to instead of input_type) and a codecs.is_text_encoding query function (http://bugs.python.org/issue19619?#msg203037) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 16 November 2013 21:49, M.-A. Lemburg m...@egenix.com wrote: On 16.11.2013 01:47, Victor Stinner wrote: Adding transform()/untransform() method to bytes and str is a non trivial change and not everybody likes them. Anyway, it's too late for Python 3.4. Just to clarify: I still like the idea of adding those methods. I just don't see what this addition has to do with the codecs.encode()/ .decode() functions. Part of the interest here is in making Python 3 better compete with the ease of the following in Python 2: 68656c6c6f.decode(hex) 'hello' hello.encode(hex) '68656c6c6f' Until recently, I (and others) thought the best Python 3 had to offer was: import codecs codecs.getencoder(hex)(hello)[0] '68656c6c6f' codecs.getdecoder(hex)(68656c6c6f)[0] 'hello' In reality, though, Python 3 has always supported the following, it just wasn't documented so I (and others) didn't know it had actually been available as an alternative interface to the codecs machinery since Python 2.4: from codecs import encode, decode encode(hello, hex) '68656c6c6f' decode(68656c6c6f, hex) 'hello' That's almost as clean as the Python 2 version, it just requires the initial import of the convenience functions from the codecs module. The fact it is supported in Python 2 means that 2/3 compatible codecs can also use it. Accordingly, I now see ensuring that everyone has a common understanding of *what is already available* as an essential next step, and only then consider significant changes in the codecs mechanisms*. I know I learned a hell of a lot about the distinction between the type agnostic codec infrastructure and the Unicode text model over the past several months, and I think this thread shows clearly that there's still a lot of confusion over the matter, even amongst core developers. That's a problem, and something we need to fix before giving further consideration to the transform/untransform idea. *(Victor's proposal in issue 19619 is actually relatively modest, now that I understand it properly, and entails taking the existing output type checks and making it possible to do them in advance, without touching input type checks) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15.11.2013 08:13, Nick Coghlan wrote: On 15 November 2013 11:10, Terry Reedy tjre...@udel.edu wrote: On 11/14/2013 5:32 PM, Victor Stinner wrote: I don't like the functions codecs.encode() and codecs.decode() because the type of the result depends on the encoding (second parameter). We try to avoid this in Python. Such dependence is common with arithmetic. 1 + 2 3 1 + 2.0 3.0 1 + 2+0j (3+0j) sum((1,2,3), 0) 6 sum((1,2,3), 0.0) 6.0 sum((1,2,3), 0.0+0j) (6+0j) for f in (compile, eval, getattr, iter, max, min, next, open, pow, round, type, vars): type(f(*args)) # depends on the inputs That is a large fraction of the non-class builtin functions. *Type* dependence between inputs and outputs is common (and completely non-controversial). The codecs system is different, since the supported input and output types are *value* dependent, driven by the name of the codec. That's the part which makes the codec machinery interesting in general, since it combines a value driven lazy loading mechanism (based on the codec name) with the subsequent invocation of that mechanism: the default codec search algorithm goes hunting in the encodings package (or the alias dictionary), but you can register custom search algorithms and provide encodings any way you want. It does mean, however, that the most you can claim for the type signature of codecs.encode and codecs.decode is that they accept an object and return an object. Beyond that, it's completely driven by the value of the codec. Indeed. You have to think of the codec registry as a mere lookup mechanism - very much like an import. The implementation of the imported module defines which types are supported and how the encode/decode steps work. In Python 2.x, the type constraints imposed by the str and unicode convenience methods is basestring in, basestring out. As it happens, all of the standard library codecs abide by that restriction , so it was easy to interpret the codecs module itself as having the same basestring in, basestring out limitation, especially given the heavy focus on text encodings in the way it was documented. In practice, the codecs weren't that open ended - some of them only accepted 8 bit strings, some only accepted unicode, some accepted both (perhaps relying on implicit decoding to unicode), The migration to Python 3 made the contrast between the two far more stark however, hence the long and involved discussion on issue 7475, and the fact that the non-Unicode codecs are currently still missing their shorthand aliases. The proposal I posted to issue 7475 back in April (and, in the absence of any objections to the proposal, finally implemented over the past few weeks) was to take advantage of the fact that the codecs.encode and codecs.decode convenience functions exist (and have been covered by the regression test suite) as far back as Python 2.4. I did this merely by documenting the existing of the functions for Python 2.7, 3.3 and 3.4, changing the exception messages thrown for codec output type errors on the convenience methods to reference them, and by updating the Python 3.4 What's New document to explain the changes. This approach provides a Python 2/3 compatible solution for usage of non-Unicode encodings: users simply need to call the existing module level functions in the codecs module, rather than using the methods on specific builtin types. This approach also means that the binary codecs can be used with any bytes-like object (including memoryview and array.array), rather than being limited to types that implement a new method (like transform), and can also be used in Python 2/3 source compatible APIs (since the data driven nature of the problem makes 2to3 unusable as a solution, and that doesn't help single code base projects anyway). Right, and that was the main point in making codecs flexible in this respect. There are many other types which can serve as input and output - in the stdlib and interpreter as well as in extension modules that implement their own types. From my point of view, this is now just a matter of better documenting the status quo, and nudging people in the right direction when it comes to using the appropriate API for non-Unicode codecs. Since we now realise these functions have existed since Python 2.4, it doesn't make sense to try to fundamentally change direction, but instead to work on making it better. A few things I noticed while implementing the recent updates: - as you noted in your other email, while MAL is on record as saying the codecs module is intended for arbitrary codecs, not just Unicode encodings, readers of the current docs can definitely be forgiven for not realising that. We really need to better separate the codecs module docs from the text model docs (two new sections in the language reference, one for the codecs machinery and one for the text model would likely be appropriate. The io
Re: [Python-Dev] Add transform() and untranform() methods
On Fri, 15 Nov 2013 09:03:37 +1000 Nick Coghlan ncogh...@gmail.com wrote: And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. This is completely the wrong approach. There's zero justification for adding new builtin methods for this use case - encoding and decoding are generic operations, they should use functions not methods. I'm sorry, I disagree. The question is what use case it is solving, and there's zero benefit in writing codecs.encode(zlib) compared to e.g. zlib.compress(). A transform() or untransform() method, however, allows for a much more convenient spelling, with easy cascading, e.g.: b.transform(zlib).transform(base64) In other words, there's zero justification for codecs.encode() and codecs.decode(). The fact that the codecs machinery works on arbitrary object transformation is a pointless genericity, if it doesn't bring any additional convenience compared to the canonical functions in their respective modules. At this point, the only person that can get me to revert this clarification of MAL's original vision for the codecs module is Guido, since anything else completely fails to address the Python 3 adoption barrier posed by the current state of Python 3's binary codec support. I'd like to challenge your assertion that your change addresses anything. It's not easier to change b.encode(zlib) into codecs.encode(zlib, b), than it is to change it into zlib.compress(b). Regards, Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Fri, Nov 15, 2013 at 05:13:34PM +1000, Nick Coghlan wrote: A few things I noticed while implementing the recent updates: - as you noted in your other email, while MAL is on record as saying the codecs module is intended for arbitrary codecs, not just Unicode encodings, readers of the current docs can definitely be forgiven for not realising that. We really need to better separate the codecs module docs from the text model docs (two new sections in the language reference, one for the codecs machinery and one for the text model would likely be appropriate. The io module docs and those for the builtin open function may also be affected) - a mechanism for annotating frames would help avoid the need for nasty hacks like the exception wrapping that aims to make codec failures easier to debug - if codecs exposed a way to separate the input type check from the invocation of the codec, we could redirect users to the module API for bad input types as well (e.g. calling input str.encode(bz2) - if we want something that doesn't need to be imported, then encode() and decode() builtins make more sense than new methods on str, bytes and bytearray objects (since builtins would support memoryview and array.array as well, and it avoids ambiguity regarding the direction of the operation) Sounds good to me. - the codecs module should offer a way to register a new alias for an existing codec - the codecs module should offer a way to map a name to a CodecInfo object without registering a new search function It would be really good to be able to query the available codecs. For example, many applications offer an Encoding menu, where you can specify the codec used for text. That's hard in Python, since you can't retrieve a list of known codecs. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
15.11.13 12:02, Steven D'Aprano написав(ла): It would be really good to be able to query the available codecs. For example, many applications offer an Encoding menu, where you can specify the codec used for text. That's hard in Python, since you can't retrieve a list of known codecs. And you can't determine which codec is binary-text encoding. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Fri, Nov 15, 2013 at 10:22:28AM +0100, Antoine Pitrou wrote: On Fri, 15 Nov 2013 09:03:37 +1000 Nick Coghlan ncogh...@gmail.com wrote: And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. This is completely the wrong approach. There's zero justification for adding new builtin methods for this use case - encoding and decoding are generic operations, they should use functions not methods. I'm sorry, I disagree. The question is what use case it is solving, and there's zero benefit in writing codecs.encode(zlib) compared to e.g. zlib.compress(). One benefit is: import codecs codec = get_name_of_compression_codec() result = codecs.encode(data, codec) versus: codec = get_name_of_compression_codec() if codec == zlib: import zlib encoder = zlib.compress elif codec == bz2 import bz2 encoder = bz2.compress elif codec == gzip: import gzip encoder = gzip.compress elif codec == squash: import mySquashLib encoder = mySquashLib.squash elif ...: # and so on result = encoder(data) A transform() or untransform() method, however, allows for a much more convenient spelling, with easy cascading, e.g.: b.transform(zlib).transform(base64) Yes, that's quite nice. Although it need not be a method, a built-in function works for me too: # either of these: transform(transform(b, zlib), base64) encode(encode(b, zlib), base64) If encoding/decoding is intended to be completely generic (even if 99% of the uses will be with strings and bytes), is there any reason to prefer built-in functions rather than methods on object? -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Fri, 15 Nov 2013 21:28:35 +1100 Steven D'Aprano st...@pearwood.info wrote: One benefit is: import codecs codec = get_name_of_compression_codec() result = codecs.encode(data, codec) That's a good point. If encoding/decoding is intended to be completely generic (even if 99% of the uses will be with strings and bytes), is there any reason to prefer built-in functions rather than methods on object? Practicality beats purity. Personally, I've never used codecs on anything else than str and bytes objects. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
15.11.13 12:28, Steven D'Aprano написав(ла): One benefit is: import codecs codec = get_name_of_compression_codec() result = codecs.encode(data, codec) And this is a hole in a security if you don't check codec name before calling a codec. See topic about utilizing zip-bombs via codecs machinery. Also usually you need more than just uncompress binary data by Python name. You need map external compression name to internal Python codec name, you need configure decompressor object by specific options, perhaps you need different buffering strategies for different compression algorithms. See for example zipfile and tarfile sources. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15 November 2013 20:33, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 15 Nov 2013 21:28:35 +1100 Steven D'Aprano st...@pearwood.info wrote: One benefit is: import codecs codec = get_name_of_compression_codec() result = codecs.encode(data, codec) That's a good point. If encoding/decoding is intended to be completely generic (even if 99% of the uses will be with strings and bytes), is there any reason to prefer built-in functions rather than methods on object? Practicality beats purity. Personally, I've never used codecs on anything else than str and bytes objects. The reason I'm now putting some effort into better documenting the status quo for codec handling in Python 3 and filing off some of the rough edges (rather than proposing adding any new APIs to Python 3.x) is because the users I care about in this matter are web developers that already make use of the binary codecs and are adopting the single-source approach to handle supporting both Python 2 and Python 3. Armin Ronacher is the one who's been most vocal about the problem, but he's definitely not alone. A new API for binary transforms is potentially an academically interesting concept, but it solves zero current real world problems. By contrast, being clear about the fact that codecs.encode and codecs.decode exist and are available as far back as Python 2.4 helps to eliminate a genuine barrier to Python 3 adoption for a subset of the community. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Fri, 15 Nov 2013 21:45:31 +1000 Nick Coghlan ncogh...@gmail.com wrote: The reason I'm now putting some effort into better documenting the status quo for codec handling in Python 3 and filing off some of the rough edges (rather than proposing adding any new APIs to Python 3.x) is because the users I care about in this matter are web developers that already make use of the binary codecs and are adopting the single-source approach to handle supporting both Python 2 and Python 3. Armin Ronacher is the one who's been most vocal about the problem, but he's definitely not alone. zlib.compress(something) works on both Python 2 and Python 3, why do you need something else? Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
2013/11/15 Nick Coghlan ncogh...@gmail.com: The reason I'm now putting some effort into better documenting the status quo for codec handling in Python 3 and filing off some of the rough edges (rather than proposing adding any new APIs to Python 3.x) is because the users I care about in this matter are web developers that already make use of the binary codecs and are adopting the single-source approach to handle supporting both Python 2 and Python 3. Armin Ronacher is the one who's been most vocal about the problem, but he's definitely not alone. Except of Armin Ronacher, I never see anyway blocked when trying to port a project to Python3 because of these bytes=bytes and str=str codecs. I did a quick search on Google but I failed to find a question how can I write .encode(hex) or .encode(zlib) in Python 3?. It was just a quick search, it's likely that many developers hit this Python 3 regression, but I'm confident that developers are able to workaround themself this regression (ex: use directly the right Python module). I saw a lot of huge code base ported to Python 3 without the need of these codecs. For example: Django which is a web framework has been ported on Python 3, I know that Armin Ronacher also works on web things (I don't know what exactly). A new API for binary transforms is potentially an academically interesting concept, but it solves zero current real world problems. I would like to reply the same for these codecs: they are not solving any real world problem :-) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15 November 2013 12:07, Victor Stinner victor.stin...@gmail.com wrote: A new API for binary transforms is potentially an academically interesting concept, but it solves zero current real world problems. I would like to reply the same for these codecs: they are not solving any real world problem :-) As Nick is only documenting long-existing functions, I fail to see the issue here. If someone were to propose new methods, builtins, or module functions, then I could see a reason for debate. But surely simply documenting existing functions is not worth all this pushback? Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15.11.2013 12:45, Nick Coghlan wrote: On 15 November 2013 20:33, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 15 Nov 2013 21:28:35 +1100 Steven D'Aprano st...@pearwood.info wrote: One benefit is: import codecs codec = get_name_of_compression_codec() result = codecs.encode(data, codec) That's a good point. If encoding/decoding is intended to be completely generic (even if 99% of the uses will be with strings and bytes), is there any reason to prefer built-in functions rather than methods on object? Practicality beats purity. Personally, I've never used codecs on anything else than str and bytes objects. The reason I'm now putting some effort into better documenting the status quo for codec handling in Python 3 and filing off some of the rough edges (rather than proposing adding any new APIs to Python 3.x) is because the users I care about in this matter are web developers that already make use of the binary codecs and are adopting the single-source approach to handle supporting both Python 2 and Python 3. Armin Ronacher is the one who's been most vocal about the problem, but he's definitely not alone. You can add me to that list :-). Esp. the hex codec is very handy. Google returns a few thousand hits for that codec alone. One detail that people often tend to forget is the extensibility of the codec system. It is easily possible to add new codecs to the system to e.g. perform encoding, escaping, compression or other conversion operations, so the set of codecs in the stdlib is not the complete set of codecs used in the wild - and it's not intended to be. As example: We've written codecs for customers that perform special types of XML un/escaping. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 15 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2013-11-19: Python Meeting Duesseldorf ... 4 days to go : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Thu, Nov 14, 2013 at 7:32 PM, Victor Stinner victor.stin...@gmail.com wrote: I would prefer to split the registry of codecs to have 3 registries: - encoding (a better name can found): encode str=bytes, decode bytes=str - bytes: encode bytes=bytes, decode bytes=bytes - str: encode str=str, decode str=str And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. I like this idea very much. But to see IIUC, let me be more explicit... you'll have (of course, always py3k-speaking): - bytes.decode() - str ... here you can only use unicode encodings - no bytes.encode(), like today - bytes.transform() - bytes ... here you can only use things like zlib, rot13, etc - str.encode() - bytes ... here you can only use unicode encodings - no str.decode(), like today - str.transform() - str ... here you can only use things like... like what? When to use decode/encode was always a major pain point for people, so doing this extra separation and cleaning would bring more clarity to when to use what. Thanks! -- .Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ Twitter: @facundobatista ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15 November 2013 22:24, Paul Moore p.f.mo...@gmail.com wrote: On 15 November 2013 12:07, Victor Stinner victor.stin...@gmail.com wrote: A new API for binary transforms is potentially an academically interesting concept, but it solves zero current real world problems. I would like to reply the same for these codecs: they are not solving any real world problem :-) As Nick is only documenting long-existing functions, I fail to see the issue here. If someone were to propose new methods, builtins, or module functions, then I could see a reason for debate. But surely simply documenting existing functions is not worth all this pushback? There's a bit more to it than that (and that's why I started the other thread about the codec aliases before proceeding to the final step). One of the changes Victor is concerned about is that when you use an incorrect codec in one of the Unicode-encoding-only convenience methods, the recent exception updates explicitly push users towards using those module level functions instead: import codecs no good.encode(rot_13) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode() to encode to arbitrary types codecs.encode(just fine, rot_13) 'whfg svar' bno good.decode(quopri_codec) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'quopri_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode() to decode to arbitrary types codecs.decode(bjust fine, quopri_codec) b'just fine' My perspective is that, in current Python, that *is* the right thing for people to do, and any hypothetical new API proposed for Python 3.5 would do nothing to change what's right for Python 3.4 code (or Python 2/3 compatible code). I also find it bizarre that several of those arguing that this is too niche a feature to be worth refining are simultaneously in favour of a proposal to add new *methods on builtin types* for the same niche feature. The other part is the fact that I updated the What's New document to highlight these tweaks: http://docs.python.org/dev/whatsnew/3.4.html#improvements-to-handling-of-non-unicode-codecs As noted earlier in the thread, Armin Ronacher has been the most vocal of the users of this feature in Python 2 that lamented it's absence in Python 3 (see, for example, http://lucumr.pocoo.org/2012/8/11/codec-confusion/), but I've also received plenty of subsequent feedback along the lines of what he said! (such as http://bugs.python.org/issue7475#msg187630). Many of the proposed solutions from the people affected by the change haven't been usable (since they've often been based on a misunderstanding of why the method behaviour changed in Python 3 in the first place), but the pain they experience is genuine, and it can unnecessarily sour their whole experience of the transition. I consider documenting the existing module level functions and nudging users towards them when they try to use the affected codecs to be an expedient way to say yes, this is still available if you really want to use it, but the required spelling is different. However, the one thing I'm *not* going to do at this point is restore the shorthand aliases, so those opposing the lowering of this barrier to transition can take comfort in the fact they have succeeded in ensuring that the out-of-the-box experience for users of this feature migrating from Python 2 remains the unfriendly: babcdef.decode(hex) Traceback (most recent call last): File stdin, line 1, in module LookupError: unknown encoding: hex Rather than the more useful: babcdef.decode(hex) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use codecs.decode() to decode to arbitrary types Which would then lead them to the working (and still Python 2 compatible) code: codecs.decode(babcdef, hex) b'\xab\xcd\xef' Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Fri, 15 Nov 2013 23:50:23 +1000 Nick Coghlan ncogh...@gmail.com wrote: My perspective is that, in current Python, that *is* the right thing for people to do, and any hypothetical new API proposed for Python 3.5 would do nothing to change what's right for Python 3.4 code (or Python 2/3 compatible code). I also find it bizarre that several of those arguing that this is too niche a feature to be worth refining are simultaneously in favour of a proposal to add new *methods on builtin types* for the same niche feature. I am not claiming it is a niche feature, I am claiming codecs.encode() and codecs.decode() don't solve the use case like the .transform() and .untransform() methods do. (I do think it is a nice feature in Python 2, although I find myself using it mainly at the interpreter prompt, rather than in production code) As noted earlier in the thread, Armin Ronacher has been the most vocal of the users of this feature in Python 2 that lamented it's absence in Python 3 (see, for example, http://lucumr.pocoo.org/2012/8/11/codec-confusion/), but I've also received plenty of subsequent feedback along the lines of what he said! (such as http://bugs.python.org/issue7475#msg187630). The way I read it, the positive feedback was about .transform() and .untransform(), not about recommending people switch to codecs.encode() and codecs.decode(). Rather than the more useful: babcdef.decode(hex) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use codecs.decode() to decode to arbitrary types I think this may be confusing. TypeError seems to suggest that the parameter type sent by the user to the method is wrong, which is not the actual cause of the error. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 16 November 2013 00:04, Antoine Pitrou solip...@pitrou.net wrote: Rather than the more useful: babcdef.decode(hex) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use codecs.decode() to decode to arbitrary types I think this may be confusing. TypeError seems to suggest that the parameter type sent by the user to the method is wrong, which is not the actual cause of the error. The TypeError isn't new, only the part after the semi-colon telling them that codecs.decode() doesn't include the typecheck (because it isn't constrained by the text model). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
Walter Dörwald writes: Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com: 15.11.13 00:32, Victor Stinner написав(ла): And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. If the transform() method will be added, I prefer to have only one transformation method and specify a direction by the transformation name (bzip2/unbzip2). +1 -1 I can't support adding such methods (and that's why I ended up giving Nick's proposal for exposing codecs.encode and codecs.decode a +1). People think about these transformations as en- or de-coding, not transforming, most of the time. Even for a transformation that is an involution (eg, rot13), people have an very clear idea of what's encoded and what's not, and they are going to prefer the names encode and decode for these (generic) operations in many cases. Eg, I don't think s.transform(decoder) is an improvement over decode(s, codec) (but tastes vary).[1] It does mean that we need to add a redundant method, and I don't really see an advantage to it. The semantics seem slightly off to me, since the purpose of the operation is to create a new object, not transform the original in-place. (But of course str.encode and bytes.decode are precedents for those semantics.) Footnotes: [1] Arguments decoder and codec are identifiers, not metavariables. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 11/14/2013 11:13 PM, Nick Coghlan wrote: The proposal I posted to issue 7475 back in April (and, in the absence of any objections to the proposal, finally implemented over the past few weeks) was to take advantage of the fact that the codecs.encode and codecs.decode convenience functions exist (and have been covered by the regression test suite) as far back as Python 2.4. I did this merely by documenting the existing of the functions for Python 2.7, 3.3 and 3.4, changing the exception messages thrown for codec output type errors on the convenience methods to reference them, and by updating the Python 3.4 What's New document to explain the changes. Thanks for doing this work, Nick! -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On Sat, 16 Nov 2013 00:46:15 +1000 Nick Coghlan ncogh...@gmail.com wrote: On 16 November 2013 00:04, Antoine Pitrou solip...@pitrou.net wrote: Rather than the more useful: babcdef.decode(hex) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use codecs.decode() to decode to arbitrary types I think this may be confusing. TypeError seems to suggest that the parameter type sent by the user to the method is wrong, which is not the actual cause of the error. The TypeError isn't new, Really? That's not what your message said. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
Am 15.11.2013 um 16:57 schrieb Stephen J. Turnbull step...@xemacs.org: Walter Dörwald writes: Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com: 15.11.13 00:32, Victor Stinner написав(ла): And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. If the transform() method will be added, I prefer to have only one transformation method and specify a direction by the transformation name (bzip2/unbzip2). +1 -1 I can't support adding such methods (and that's why I ended up giving Nick's proposal for exposing codecs.encode and codecs.decode a +1). My +1 was only for having the transformation be one-way under the condition that it is added at all. People think about these transformations as en- or de-coding, not transforming, most of the time. Even for a transformation that is an involution (eg, rot13), people have an very clear idea of what's encoded and what's not, and they are going to prefer the names encode and decode for these (generic) operations in many cases. Eg, I don't think s.transform(decoder) is an improvement over decode(s, codec) (but tastes vary).[1] It does mean that we need to add a redundant method, and I don't really see an advantage to it. Actually my preferred method would be codec.decode(s). codec being the module that implements the functionality. I don't think we need to invent another function registry. The semantics seem slightly off to me, since the purpose of the operation is to create a new object, not transform the original in-place. This would mean the method would have to be called transformed()? (But of course str.encode and bytes.decode are precedents for those semantics.) Footnotes: [1] Arguments decoder and codec are identifiers, not metavariables. Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 16 Nov 2013 02:36, Antoine Pitrou solip...@pitrou.net wrote: On Sat, 16 Nov 2013 00:46:15 +1000 Nick Coghlan ncogh...@gmail.com wrote: On 16 November 2013 00:04, Antoine Pitrou solip...@pitrou.net wrote: Rather than the more useful: babcdef.decode(hex) Traceback (most recent call last): File stdin, line 1, in module TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use codecs.decode() to decode to arbitrary types I think this may be confusing. TypeError seems to suggest that the parameter type sent by the user to the method is wrong, which is not the actual cause of the error. The TypeError isn't new, Really? That's not what your message said. The second example in my post included restoring the hex alias for hex_codec (its absence is the reason for the current unknown encoding error). The 3.2 and 3.3 error message for a restored alias would have been TypeError: 'hex' decoder returned 'bytes' instead of 'str', which I agree is confusing and uninformative - that's why I added the reference to the module level functions to the output type errors *before* proposing the restoration of the aliases. So you can already use codecs.decode(s, 'hex_codec') in Python 3, you just won't get a useful error leading you there if you use the more common 'hex' alias instead. To address Serhiy's security concerns with the compression codecs (which are technically independent of the question of restoring the aliases), I also plan to document how to systematically blacklist particular codecs in an application by setting attributes on the encodings module and/or appropriate entries in sys.modules. Finally, I now plan to write a documentation PEP that suggests clearly splitting the codecs module docs into two layers: the type agnostic core infrastructure and the specific application of that infrastructure to the implementation of the text encoding model. The only functional *change* I'd still like to make for 3.4 is to restore the shorthand aliases for the non-Unicode codecs (to ease the migration for folks coming from Python 2), but this thread has convinced me I likely need to write the PEP *before* doing that, and I still have to integrate ensurepip into pyvenv before the beta 1 deadline. So unless you and Victor are prepared to +1 the restoration of the codec aliases (closing issue 7475) in anticipation of that codecs infrastructure documentation PEP, the change to restore the aliases probably won't be in 3.4. (I *might* get the PEP written in time regardless, but I'm not betting on it at this point). Cheers, Nick. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
2013/11/16 Nick Coghlan ncogh...@gmail.com: To address Serhiy's security concerns with the compression codecs (which are technically independent of the question of restoring the aliases), I also plan to document how to systematically blacklist particular codecs in an application by setting attributes on the encodings module and/or appropriate entries in sys.modules. I would be simpler and safer to blacklist bytes=bytes and str=str codecs from bytes.decode() and str.encode() directly. Marc Andre Lemburg proposed to add new attributes in CodecInfo to specify input and output types. The only functional *change* I'd still like to make for 3.4 is to restore the shorthand aliases for the non-Unicode codecs (to ease the migration for folks coming from Python 2), but this thread has convinced me I likely need to write the PEP *before* doing that, and I still have to integrate ensurepip into pyvenv before the beta 1 deadline. So unless you and Victor are prepared to +1 the restoration of the codec aliases (closing issue 7475) in anticipation of that codecs infrastructure documentation PEP, the change to restore the aliases probably won't be in 3.4. (I *might* get the PEP written in time regardless, but I'm not betting on it at this point). Using StackOverflow search engine, I found some posts where people asks for hex codec on Python 3. There are two answers: use binascii module or use codecs.encode(). So even if codecs.encode() was never documented, it looks like it is used. So I now agree that documenting it would not make the situation worse. Adding transform()/untransform() method to bytes and str is a non trivial change and not everybody likes them. Anyway, it's too late for Python 3.4. In my opinion, the best option is to add new input_type/output_type attributes to CodecInfo right now, and modify the codecs so abc.encode(hex) raises a LookupError (instead of tricky error message with some evil low-level hacks on the traceback and the exception, which is my initial concern in this mail thread). It fixes also the security vulnerability. To keep backward compatibility (even with custom codecs registered manually), if input_type/output_type is not defined, we should consider that the codec is a classical text encoding (encode str=bytes, decode bytes=str). The type of codecs.encode() result is my least concern in this topic. I created the following issue to implement my idea: http://bugs.python.org/issue19619 Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
Oh, I forgot to mention that I sent this email in reaction to this issue: http://bugs.python.org/issue19585 Modifying the critical PyFrameObject because the codecs API raises surprising errors doesn't sound correct. I prefer to fix how codecs are used, than modifying the PyFrameObject. For more information, see the issue #7475 which a long history (4 years) and many messages. Martin von Loewis wrote I would still be opposed to such a change, and I think it needs a PEP. and I still agree with him on this point. Because they are different opinions and no consensus, a PEP is required to explain why we took this decision and list rejected alternatives. http://bugs.python.org/issue7475 Victor 2013/11/14 Victor Stinner victor.stin...@gmail.com: Hi, I saw that Nick Coghlan documented codecs.encode() and codecs.decode(), and changed the exception raised when codecs like rot_13 are used on bytes.decode() and str.encode(). I don't like the functions codecs.encode() and codecs.decode() because the type of the result depends on the encoding (second parameter). We try to avoid this in Python. I would prefer to split the registry of codecs to have 3 registries: - encoding (a better name can found): encode str=bytes, decode bytes=str - bytes: encode bytes=bytes, decode bytes=bytes - str: encode str=str, decode str=str And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. Examples: - utf8: encoding - zlib: bytes - rot13: str The result type of bytes.transform/untransform would be bytes, and the result type of str.transform/untransform would be str. I don't know which exception should be raised when a codec is used in the wrong method. LookupError? TypeError codec xxx cannot be used with method xxx.xx? Something else? codecs.encode/decode() documentation should be removed. The functions should be kept, just in case if someone uses them. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15 Nov 2013 08:34, Victor Stinner victor.stin...@gmail.com wrote: Hi, I saw that Nick Coghlan documented codecs.encode() and codecs.decode(), and changed the exception raised when codecs like rot_13 are used on bytes.decode() and str.encode(). I don't like the functions codecs.encode() and codecs.decode() because the type of the result depends on the encoding (second parameter). We try to avoid this in Python. The type signature of those functions is just object - object (Similar to the way the 2.x convenience methods were actually basestring - basestring). I would prefer to split the registry of codecs to have 3 registries: - encoding (a better name can found): encode str=bytes, decode bytes=str - bytes: encode bytes=bytes, decode bytes=bytes - str: encode str=str, decode str=str You have to get it out of your head that codecs are just about text and and binary data. They're not: they're arbitrary type transforms, and MAL deliberately wrote the module that way. And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. This is completely the wrong approach. There's zero justification for adding new builtin methods for this use case - encoding and decoding are generic operations, they should use functions not methods. What could be useful is allowing CodecInfo objects to supply an expected input type and an expected output type (ABCs and instance check overrides make that quite flexible). Examples: - utf8: encoding - zlib: bytes - rot13: str The result type of bytes.transform/untransform would be bytes, and the result type of str.transform/untransform would be str. I don't know which exception should be raised when a codec is used in the wrong method. LookupError? TypeError codec xxx cannot be used with method xxx.xx? Something else? We already do this check in the existing convenience methods - it raises TypeError. codecs.encode/decode() documentation should be removed. The functions should be kept, just in case if someone uses them. No. They're part of the regression test suite, and have been since Python 2.4. They embody MAL's intended arbitrary type transform library approach. They provide a source compatible mechanism for using binary codecs in single code base Python 2/3 projects. At this point, the only person that can get me to revert this clarification of MAL's original vision for the codecs module is Guido, since anything else completely fails to address the Python 3 adoption barrier posed by the current state of Python 3's binary codec support. Note that the only behavioural changes in the commits so far were to exception handling - everything else was just docs. The next planned commit (to restore the binary codec aliases) *is* a behavioural change - that's why I posted to the list about it (it received only two responses, both +1) Cheers, Nick. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15 Nov 2013 08:42, Victor Stinner victor.stin...@gmail.com wrote: Oh, I forgot to mention that I sent this email in reaction to this issue: http://bugs.python.org/issue19585 Modifying the critical PyFrameObject because the codecs API raises surprising errors doesn't sound correct. I prefer to fix how codecs are used, than modifying the PyFrameObject. For more information, see the issue #7475 which a long history (4 years) and many messages. Martin von Loewis wrote I would still be opposed to such a change, and I think it needs a PEP. and I still agree with him on this point. Because they are different opinions and no consensus, a PEP is required to explain why we took this decision and list rejected alternatives. http://bugs.python.org/issue7475 Martin wrote that before it was pointed out there were existing functions to handle the problem (I was asking for a PEP back then, too). I posted my plan for dealing with this months ago without receiving any complaints, and I'm annoyed you waited until I had actually followed through and implemented it to complain about it and ask for Python 3's binary codec support to stay broken instead :P (Starting a new thread instead of replying to the one where I specifically asked about taking the next step does nothing to improve my mood) Regards, Nick. Victor 2013/11/14 Victor Stinner victor.stin...@gmail.com: Hi, I saw that Nick Coghlan documented codecs.encode() and codecs.decode(), and changed the exception raised when codecs like rot_13 are used on bytes.decode() and str.encode(). I don't like the functions codecs.encode() and codecs.decode() because the type of the result depends on the encoding (second parameter). We try to avoid this in Python. I would prefer to split the registry of codecs to have 3 registries: - encoding (a better name can found): encode str=bytes, decode bytes=str - bytes: encode bytes=bytes, decode bytes=bytes - str: encode str=str, decode str=str And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. Examples: - utf8: encoding - zlib: bytes - rot13: str The result type of bytes.transform/untransform would be bytes, and the result type of str.transform/untransform would be str. I don't know which exception should be raised when a codec is used in the wrong method. LookupError? TypeError codec xxx cannot be used with method xxx.xx? Something else? codecs.encode/decode() documentation should be removed. The functions should be kept, just in case if someone uses them. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
15.11.13 01:03, Nick Coghlan написав(ла): We already do this check in the existing convenience methods - it raises TypeError. The problem with this check is that it happens *after* encoding/decoding. This opens door for DoS (see my last message). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15 Nov 2013 09:11, Nick Coghlan ncogh...@gmail.com wrote: On 15 Nov 2013 08:42, Victor Stinner victor.stin...@gmail.com wrote: Oh, I forgot to mention that I sent this email in reaction to this issue: http://bugs.python.org/issue19585 Modifying the critical PyFrameObject because the codecs API raises surprising errors doesn't sound correct. I prefer to fix how codecs are used, than modifying the PyFrameObject. For more information, see the issue #7475 which a long history (4 years) and many messages. Martin von Loewis wrote I would still be opposed to such a change, and I think it needs a PEP. and I still agree with him on this point. Because they are different opinions and no consensus, a PEP is required to explain why we took this decision and list rejected alternatives. http://bugs.python.org/issue7475 Martin wrote that before it was pointed out there were existing functions to handle the problem (I was asking for a PEP back then, too). I posted my plan for dealing with this months ago without receiving any complaints, and I'm annoyed you waited until I had actually followed through and implemented it to complain about it and ask for Python 3's binary codec support to stay broken instead :P Something I *would* be entirely happy to do is write a retroactive PEP after beta 1 is out the door, explaining the history of this issue in a more coherent form than the comment history on issue 7475 and the many child issues it spawned. This would also provide a better launching point for other enhancements in Python 3.5 (frame annotations to remove the need for the exception chaining hack and better input validation mechanisms for codecs that allow the convenience methods to check that case explicitly rather than relying on the exception chaining). Cheers, Nick. (Starting a new thread instead of replying to the one where I specifically asked about taking the next step does nothing to improve my mood) Regards, Nick. Victor 2013/11/14 Victor Stinner victor.stin...@gmail.com: Hi, I saw that Nick Coghlan documented codecs.encode() and codecs.decode(), and changed the exception raised when codecs like rot_13 are used on bytes.decode() and str.encode(). I don't like the functions codecs.encode() and codecs.decode() because the type of the result depends on the encoding (second parameter). We try to avoid this in Python. I would prefer to split the registry of codecs to have 3 registries: - encoding (a better name can found): encode str=bytes, decode bytes=str - bytes: encode bytes=bytes, decode bytes=bytes - str: encode str=str, decode str=str And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. Examples: - utf8: encoding - zlib: bytes - rot13: str The result type of bytes.transform/untransform would be bytes, and the result type of str.transform/untransform would be str. I don't know which exception should be raised when a codec is used in the wrong method. LookupError? TypeError codec xxx cannot be used with method xxx.xx? Something else? codecs.encode/decode() documentation should be removed. The functions should be kept, just in case if someone uses them. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
15.11.13 00:32, Victor Stinner написав(ла): And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. If the transform() method will be added, I prefer to have only one transformation method and specify a direction by the transformation name (bzip2/unbzip2). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 11/14/2013 5:32 PM, Victor Stinner wrote: I don't like the functions codecs.encode() and codecs.decode() because the type of the result depends on the encoding (second parameter). We try to avoid this in Python. Such dependence is common with arithmetic. 1 + 2 3 1 + 2.0 3.0 1 + 2+0j (3+0j) sum((1,2,3), 0) 6 sum((1,2,3), 0.0) 6.0 sum((1,2,3), 0.0+0j) (6+0j) for f in (compile, eval, getattr, iter, max, min, next, open, pow, round, type, vars): type(f(*args)) # depends on the inputs That is a large fraction of the non-class builtin functions. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 11/14/2013 6:03 PM, Nick Coghlan wrote: You have to get it out of your head that codecs are just about text and and binary data. 99+% of the current codec module doc leads one to that impression. The fact that codecs are expected to have a file reader and writer and that the default 'strict' error handler is specified in 2 out of the 3 mostly redundant lists as raising a UnicodeError reinforces the impression. They're not: they're arbitrary type transforms, and MAL deliberately wrote the module that way. Generic functions are quite pythonic. However, I am not sure how much benefit there is to registering an arbitrary pair of bijective functions This is completely the wrong approach. There's zero justification for adding new builtin methods for this use case - encoding and decoding are generic operations, they should use functions not methods. Making 23 code easier is certainly a good reason for the codecs approach. The next planned commit (to restore the binary codec aliases) *is* a behavioural change - that's why I posted to the list about it (it received only two responses, both +1) If I understand correctly, I am mildly +1, but did not respond, thinking that 2 to 0 was sufficient response for you to continue ;-). -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com: 15.11.13 00:32, Victor Stinner написав(ла): And add transform() and untransform() methods to bytes and str types. In practice, it might be same codecs registry for all codecs just with a new attribute. If the transform() method will be added, I prefer to have only one transformation method and specify a direction by the transformation name (bzip2/unbzip2). +1 Some of the transformations might not be revertible (s.transform(lower)? ;)) And the transform function probably doesn't need any error handling machinery. What about the stream/iterator/incremental parts of the codec API? Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add transform() and untranform() methods
On 15 November 2013 11:10, Terry Reedy tjre...@udel.edu wrote: On 11/14/2013 5:32 PM, Victor Stinner wrote: I don't like the functions codecs.encode() and codecs.decode() because the type of the result depends on the encoding (second parameter). We try to avoid this in Python. Such dependence is common with arithmetic. 1 + 2 3 1 + 2.0 3.0 1 + 2+0j (3+0j) sum((1,2,3), 0) 6 sum((1,2,3), 0.0) 6.0 sum((1,2,3), 0.0+0j) (6+0j) for f in (compile, eval, getattr, iter, max, min, next, open, pow, round, type, vars): type(f(*args)) # depends on the inputs That is a large fraction of the non-class builtin functions. *Type* dependence between inputs and outputs is common (and completely non-controversial). The codecs system is different, since the supported input and output types are *value* dependent, driven by the name of the codec. That's the part which makes the codec machinery interesting in general, since it combines a value driven lazy loading mechanism (based on the codec name) with the subsequent invocation of that mechanism: the default codec search algorithm goes hunting in the encodings package (or the alias dictionary), but you can register custom search algorithms and provide encodings any way you want. It does mean, however, that the most you can claim for the type signature of codecs.encode and codecs.decode is that they accept an object and return an object. Beyond that, it's completely driven by the value of the codec. In Python 2.x, the type constraints imposed by the str and unicode convenience methods is basestring in, basestring out. As it happens, all of the standard library codecs abide by that restriction , so it was easy to interpret the codecs module itself as having the same basestring in, basestring out limitation, especially given the heavy focus on text encodings in the way it was documented. In practice, the codecs weren't that open ended - some of them only accepted 8 bit strings, some only accepted unicode, some accepted both (perhaps relying on implicit decoding to unicode), The migration to Python 3 made the contrast between the two far more stark however, hence the long and involved discussion on issue 7475, and the fact that the non-Unicode codecs are currently still missing their shorthand aliases. The proposal I posted to issue 7475 back in April (and, in the absence of any objections to the proposal, finally implemented over the past few weeks) was to take advantage of the fact that the codecs.encode and codecs.decode convenience functions exist (and have been covered by the regression test suite) as far back as Python 2.4. I did this merely by documenting the existing of the functions for Python 2.7, 3.3 and 3.4, changing the exception messages thrown for codec output type errors on the convenience methods to reference them, and by updating the Python 3.4 What's New document to explain the changes. This approach provides a Python 2/3 compatible solution for usage of non-Unicode encodings: users simply need to call the existing module level functions in the codecs module, rather than using the methods on specific builtin types. This approach also means that the binary codecs can be used with any bytes-like object (including memoryview and array.array), rather than being limited to types that implement a new method (like transform), and can also be used in Python 2/3 source compatible APIs (since the data driven nature of the problem makes 2to3 unusable as a solution, and that doesn't help single code base projects anyway). From my point of view, this is now just a matter of better documenting the status quo, and nudging people in the right direction when it comes to using the appropriate API for non-Unicode codecs. Since we now realise these functions have existed since Python 2.4, it doesn't make sense to try to fundamentally change direction, but instead to work on making it better. A few things I noticed while implementing the recent updates: - as you noted in your other email, while MAL is on record as saying the codecs module is intended for arbitrary codecs, not just Unicode encodings, readers of the current docs can definitely be forgiven for not realising that. We really need to better separate the codecs module docs from the text model docs (two new sections in the language reference, one for the codecs machinery and one for the text model would likely be appropriate. The io module docs and those for the builtin open function may also be affected) - a mechanism for annotating frames would help avoid the need for nasty hacks like the exception wrapping that aims to make codec failures easier to debug - if codecs exposed a way to separate the input type check from the invocation of the codec, we could redirect users to the module API for bad input types as well (e.g. calling input str.encode(bz2) - if we want something that doesn't need to be imported, then encode() and decode() builtins make more sense than new methods on str, bytes