Re: [Python-Dev] Add transform() and untranform() methods

2013-11-16 Thread Nick Coghlan
On 16 Nov 2013 10:47, Victor Stinner victor.stin...@gmail.com wrote:

 2013/11/16 Nick Coghlan ncogh...@gmail.com:
  To address Serhiy's security concerns with the compression codecs (which are
  technically independent of the question of restoring the aliases), I also
  plan to document how to systematically blacklist particular codecs in an
  application by setting attributes on the encodings module and/or appropriate
  entries in sys.modules.

 I would be simpler and safer to blacklist bytes=bytes and str=str
 codecs from bytes.decode() and str.encode() directly. Marc Andre
 Lemburg proposed to add new attributes in CodecInfo to specify input
 and output types.

Yes, but that type compatibility introspection is a change for 3.5 at
the earliest (although I commented on
http://bugs.python.org/issue19619 with two alternate suggestions that
I think would be reasonable to implement for 3.4).

Everything codec related that I am doing at the moment is about
improving the state of 3.4 and source compatible 2/3 code. Proposals
for further 3.5+ only improvements are relevant only in the sense that
I don't want to lock us out from future improvements (which is why my
main aim is to clarify the status quo, with the only functional
changes related to restoring feature parity with Python 2 for
non-Unicode codecs).

  The only functional *change* I'd still like to make for 3.4 is to restore
  the shorthand aliases for the non-Unicode codecs (to ease the migration for
  folks coming from Python 2), but this thread has convinced me I likely need
  to write the PEP *before* doing that, and I still have to integrate
  ensurepip into pyvenv before the beta 1 deadline.
 
  So unless you and Victor are prepared to +1 the restoration of the codec
  aliases (closing issue 7475) in anticipation of that codecs infrastructure
  documentation PEP, the change to restore the aliases probably won't be in
  3.4. (I *might* get the PEP written in time regardless, but I'm not betting
  on it at this point).

 Using StackOverflow search engine, I found some posts where people
 asks for hex codec on Python 3. There are two answers: use binascii
 module or use codecs.encode(). So even if codecs.encode() was never
 documented, it looks like it is used. So I now agree that documenting
 it would not make the situation worse.

Aye, that was my conclusion (hence my proposal on issue 7475 back in April).

Can I take that observation as a +1 for restoring the aliases as well?
(That and more efficiently rejecting the non-Unicode codecs from
str.encode, bytes.decode and bytearray.decode are the only aspects of
this subject to the beta 1 deadline - we can be a bit more leisurely
when it comes to working out the details of the docs updates)

 Adding transform()/untransform() method to bytes and str is a non
 trivial change and not everybody likes them. Anyway, it's too late for
 Python 3.4.

 In my opinion, the best option is to add new input_type/output_type
 attributes to CodecInfo right now, and modify the codecs so
 abc.encode(hex) raises a LookupError (instead of tricky error
 message with some evil low-level hacks on the traceback and the
 exception, which is my initial concern in this mail thread). It fixes
 also the security vulnerability.

The C level code for catching the input type errors only looks evil because:

- the C level equivalent of exception Exception as Y: raise X from Y
is just plain ugly in the first place
- the chaining includes a *lot* of checks of the original exception to
ensure that no data is lost by raising a new instance of the same
exception Type and chaining
- it chains ValueError, AttributeError and any other currently
stateless (aside from a str description) error the codec might throw,
not just input type validation errors (it deliberately doesn't chain
stateful errors as doing so might be backwards incompatible with
existing error handling).

However, the ugliness of that code is the reason I'm intrigued by the
possibility of traceback annotations as a potentially cleaner solution
than trying to seamlessly wrap exceptions with a new one that adds
more context information. While I think the gain in codec
debuggability is worth it in this case, my concern over the complexity
and the current limitations are the reason I didn't make it a public C
API.

 To keep backward compatibility (even with custom codecs registered
 manually), if input_type/output_type is not defined, we should
 consider that the codec is a classical text encoding (encode
 str=bytes, decode bytes=str).

Without an already existing ByteSequence ABC , it isn't feasible to
propose and implement this completely in the 3.4 time frame (since you
would need such an ABC to express the input type accepted by our
Unicode and binary codecs - the only one that wouldn't need it is
rot_13, since that's str-str).

However, the output types could be expressed solely as concrete types,
and that's all we need for the blacklist (since we could replace the
current 

Re: [Python-Dev] Add transform() and untranform() methods

2013-11-16 Thread Victor Stinner
Why not using str type for str and str subtypes, and bytes type for bytes
and bytes-like object (bytearray, memoryview)? I don't think that we need
an ABC here.

Victor
Le 16 nov. 2013 10:44, Nick Coghlan ncogh...@gmail.com a écrit :

 On 16 Nov 2013 10:47, Victor Stinner victor.stin...@gmail.com wrote:
 
  2013/11/16 Nick Coghlan ncogh...@gmail.com:
   To address Serhiy's security concerns with the compression codecs
 (which are
   technically independent of the question of restoring the aliases), I
 also
   plan to document how to systematically blacklist particular codecs in
 an
   application by setting attributes on the encodings module and/or
 appropriate
   entries in sys.modules.
 
  I would be simpler and safer to blacklist bytes=bytes and str=str
  codecs from bytes.decode() and str.encode() directly. Marc Andre
  Lemburg proposed to add new attributes in CodecInfo to specify input
  and output types.

 Yes, but that type compatibility introspection is a change for 3.5 at
 the earliest (although I commented on
 http://bugs.python.org/issue19619 with two alternate suggestions that
 I think would be reasonable to implement for 3.4).

 Everything codec related that I am doing at the moment is about
 improving the state of 3.4 and source compatible 2/3 code. Proposals
 for further 3.5+ only improvements are relevant only in the sense that
 I don't want to lock us out from future improvements (which is why my
 main aim is to clarify the status quo, with the only functional
 changes related to restoring feature parity with Python 2 for
 non-Unicode codecs).

   The only functional *change* I'd still like to make for 3.4 is to
 restore
   the shorthand aliases for the non-Unicode codecs (to ease the
 migration for
   folks coming from Python 2), but this thread has convinced me I likely
 need
   to write the PEP *before* doing that, and I still have to integrate
   ensurepip into pyvenv before the beta 1 deadline.
  
   So unless you and Victor are prepared to +1 the restoration of the
 codec
   aliases (closing issue 7475) in anticipation of that codecs
 infrastructure
   documentation PEP, the change to restore the aliases probably won't be
 in
   3.4. (I *might* get the PEP written in time regardless, but I'm not
 betting
   on it at this point).
 
  Using StackOverflow search engine, I found some posts where people
  asks for hex codec on Python 3. There are two answers: use binascii
  module or use codecs.encode(). So even if codecs.encode() was never
  documented, it looks like it is used. So I now agree that documenting
  it would not make the situation worse.

 Aye, that was my conclusion (hence my proposal on issue 7475 back in
 April).

 Can I take that observation as a +1 for restoring the aliases as well?
 (That and more efficiently rejecting the non-Unicode codecs from
 str.encode, bytes.decode and bytearray.decode are the only aspects of
 this subject to the beta 1 deadline - we can be a bit more leisurely
 when it comes to working out the details of the docs updates)

  Adding transform()/untransform() method to bytes and str is a non
  trivial change and not everybody likes them. Anyway, it's too late for
  Python 3.4.
 
  In my opinion, the best option is to add new input_type/output_type
  attributes to CodecInfo right now, and modify the codecs so
  abc.encode(hex) raises a LookupError (instead of tricky error
  message with some evil low-level hacks on the traceback and the
  exception, which is my initial concern in this mail thread). It fixes
  also the security vulnerability.

 The C level code for catching the input type errors only looks evil
 because:

 - the C level equivalent of exception Exception as Y: raise X from Y
 is just plain ugly in the first place
 - the chaining includes a *lot* of checks of the original exception to
 ensure that no data is lost by raising a new instance of the same
 exception Type and chaining
 - it chains ValueError, AttributeError and any other currently
 stateless (aside from a str description) error the codec might throw,
 not just input type validation errors (it deliberately doesn't chain
 stateful errors as doing so might be backwards incompatible with
 existing error handling).

 However, the ugliness of that code is the reason I'm intrigued by the
 possibility of traceback annotations as a potentially cleaner solution
 than trying to seamlessly wrap exceptions with a new one that adds
 more context information. While I think the gain in codec
 debuggability is worth it in this case, my concern over the complexity
 and the current limitations are the reason I didn't make it a public C
 API.

  To keep backward compatibility (even with custom codecs registered
  manually), if input_type/output_type is not defined, we should
  consider that the codec is a classical text encoding (encode
  str=bytes, decode bytes=str).

 Without an already existing ByteSequence ABC , it isn't feasible to
 propose and implement this completely in the 3.4 time 

Re: [Python-Dev] Add transform() and untranform() methods

2013-11-16 Thread Nick Coghlan
On 16 November 2013 20:45, Victor Stinner victor.stin...@gmail.com wrote:
 Why not using str type for str and str subtypes, and bytes type for bytes
 and bytes-like object (bytearray, memoryview)? I don't think that we need an
 ABC here.

We'd only need an ABC if info was added for supported input types.
However, that's not necessary since encodes_to and decodes_to  are
enough to identify Unicode encodings: encodes_to in (None, bytes) and
decodes_to in (None, str), so we don't need to track input type
support at all if the main question we want to answer is is this a
Unicode codec or not?.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-16 Thread M.-A. Lemburg
On 16.11.2013 01:47, Victor Stinner wrote:
 Adding transform()/untransform() method to bytes and str is a non
 trivial change and not everybody likes them. Anyway, it's too late for
 Python 3.4.

Just to clarify: I still like the idea of adding those methods.

I just don't see what this addition has to do with the codecs.encode()/
.decode() functions.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 16 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2013-11-19: Python Meeting Duesseldorf ...  3 days to go

: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-16 Thread Antoine Pitrou
On Sat, 16 Nov 2013 19:44:51 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
 Aye, that was my conclusion (hence my proposal on issue 7475 back in April).
 
 Can I take that observation as a +1 for restoring the aliases as well?

I see no harm in restoring the aliases personally, so +1 from me.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-16 Thread Nick Coghlan
On 16 November 2013 21:38, Nick Coghlan ncogh...@gmail.com wrote:
 On 16 November 2013 20:45, Victor Stinner victor.stin...@gmail.com wrote:
 Why not using str type for str and str subtypes, and bytes type for bytes
 and bytes-like object (bytearray, memoryview)? I don't think that we need an
 ABC here.

 We'd only need an ABC if info was added for supported input types.
 However, that's not necessary since encodes_to and decodes_to  are
 enough to identify Unicode encodings: encodes_to in (None, bytes) and
 decodes_to in (None, str), so we don't need to track input type
 support at all if the main question we want to answer is is this a
 Unicode codec or not?.

I realised I misunderstood your proposal because of the field names
you initially suggested. I've now proposed a variation with different
field names (encodes_to instead of output_type and decodes_to instead
of input_type) and a codecs.is_text_encoding query function
(http://bugs.python.org/issue19619?#msg203037)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-16 Thread Nick Coghlan
On 16 November 2013 21:49, M.-A. Lemburg m...@egenix.com wrote:
 On 16.11.2013 01:47, Victor Stinner wrote:
 Adding transform()/untransform() method to bytes and str is a non
 trivial change and not everybody likes them. Anyway, it's too late for
 Python 3.4.

 Just to clarify: I still like the idea of adding those methods.

 I just don't see what this addition has to do with the codecs.encode()/
 .decode() functions.

Part of the interest here is in making Python 3 better compete with
the ease of the following in Python 2:

 68656c6c6f.decode(hex)
'hello'
 hello.encode(hex)
'68656c6c6f'

Until recently, I (and others) thought the best Python 3 had to offer was:

 import codecs
 codecs.getencoder(hex)(hello)[0]
'68656c6c6f'
 codecs.getdecoder(hex)(68656c6c6f)[0]
'hello'

In reality, though, Python 3 has always supported the following, it
just wasn't documented so I (and others) didn't know it had actually
been available as an alternative interface to the codecs machinery
since Python 2.4:

 from codecs import encode, decode
 encode(hello, hex)
'68656c6c6f'
 decode(68656c6c6f, hex)
'hello'

That's almost as clean as the Python 2 version, it just requires the
initial import of the convenience functions from the codecs module.
The fact it is supported in Python 2 means that 2/3 compatible codecs
can also use it.

Accordingly, I now see ensuring that everyone has a common
understanding of *what is already available* as an essential next
step, and only then consider significant changes in the codecs
mechanisms*. I know I learned a hell of a lot about the distinction
between the type agnostic codec infrastructure and the Unicode text
model over the past several months, and I think this thread shows
clearly that there's still a lot of confusion over the matter, even
amongst core developers. That's a problem, and something we need to
fix before giving further consideration to the transform/untransform
idea.

*(Victor's proposal in issue 19619 is actually relatively modest, now
that I understand it properly, and entails taking the existing output
type checks and making it possible to do them in advance, without
touching input type checks)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread M.-A. Lemburg
On 15.11.2013 08:13, Nick Coghlan wrote:
 On 15 November 2013 11:10, Terry Reedy tjre...@udel.edu wrote:
 On 11/14/2013 5:32 PM, Victor Stinner wrote:

 I don't like the functions codecs.encode() and codecs.decode() because
 the type of the result depends on the encoding (second parameter). We
 try to avoid this in Python.


 Such dependence is common with arithmetic.

 1 + 2
 3
 1 + 2.0
 3.0
 1 + 2+0j
 (3+0j)

 sum((1,2,3), 0)
 6
 sum((1,2,3), 0.0)
 6.0
 sum((1,2,3), 0.0+0j)
 (6+0j)

 for f in (compile, eval, getattr, iter, max, min, next, open, pow, round,
 type, vars):
   type(f(*args)) # depends on the inputs
 That is a large fraction of the non-class builtin functions.
 
 *Type* dependence between inputs and outputs is common (and completely
 non-controversial). The codecs system is different, since the
 supported input and output types are *value* dependent, driven by the
 name of the codec.
 
 That's the part which makes the codec machinery interesting in
 general, since it combines a value driven lazy loading mechanism
 (based on the codec name) with the subsequent invocation of that
 mechanism: the default codec search algorithm goes hunting in the
 encodings package (or the alias dictionary), but you can register
 custom search algorithms and provide encodings any way you want. It
 does mean, however, that the most you can claim for the type signature
 of codecs.encode and codecs.decode is that they accept an object and
 return an object. Beyond that, it's completely driven by the value of
 the codec.

Indeed. You have to think of the codec registry as a mere
lookup mechanism - very much like an import. The implementation
of the imported module defines which types are supported and
how the encode/decode steps work.

 In Python 2.x, the type constraints imposed by the str and unicode
 convenience methods is basestring in, basestring out. As it happens,
 all of the standard library codecs abide by that restriction , so it
 was easy to interpret the codecs module itself as having the same
 basestring in, basestring out limitation, especially given the heavy
 focus on text encodings in the way it was documented. In practice, the
 codecs weren't that open ended - some of them only accepted 8 bit
 strings, some only accepted unicode, some accepted both (perhaps
 relying on implicit decoding to unicode),
 
 The migration to Python 3 made the contrast between the two far more
 stark however, hence the long and involved discussion on issue 7475,
 and the fact that the non-Unicode codecs are currently still missing
 their shorthand aliases.
 
 The proposal I posted to issue 7475 back in April (and, in the absence
 of any objections to the proposal, finally implemented over the past
 few weeks) was to take advantage of the fact that the codecs.encode
 and codecs.decode convenience functions exist (and have been covered
 by the regression test suite) as far back as Python 2.4. I did this
 merely by documenting the existing of the functions for Python 2.7,
 3.3 and 3.4, changing the exception messages thrown for codec output
 type errors on the convenience methods to reference them, and by
 updating the Python 3.4 What's New document to explain the changes.
 
 This approach provides a Python 2/3 compatible solution for usage of
 non-Unicode encodings: users simply need to call the existing module
 level functions in the codecs module, rather than using the methods on
 specific builtin types. This approach also means that the binary
 codecs can be used with any bytes-like object (including memoryview
 and array.array), rather than being limited to types that implement a
 new method (like transform), and can also be used in Python 2/3
 source compatible APIs (since the data driven nature of the problem
 makes 2to3 unusable as a solution, and that doesn't help single code
 base projects anyway).

Right, and that was the main point in making codecs flexible
in this respect. There are many other types which can serve
as input and output - in the stdlib and interpreter as well as
in extension modules that implement their own types.

From my point of view, this is now just a matter of better documenting
 the status quo, and nudging people in the right direction when it
 comes to using the appropriate API for non-Unicode codecs. Since we
 now realise these functions have existed since Python 2.4, it doesn't
 make sense to try to fundamentally change direction, but instead to
 work on making it better.
 
 A few things I noticed while implementing the recent updates:
 
 - as you noted in your other email, while MAL is on record as saying
 the codecs module is intended for arbitrary codecs, not just Unicode
 encodings, readers of the current docs can definitely be forgiven for
 not realising that. We really need to better separate the codecs
 module docs from the text model docs (two new sections in the language
 reference, one for the codecs machinery and one for the text model
 would likely be appropriate. The io 

Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Antoine Pitrou
On Fri, 15 Nov 2013 09:03:37 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
  And add transform() and untransform() methods to bytes and str types.
  In practice, it might be same codecs registry for all codecs just with
  a new attribute.
 
 This is completely the wrong approach. There's zero justification for
 adding new builtin methods for this use case - encoding and decoding are
 generic operations, they should use functions not methods.

I'm sorry, I disagree. The question is what use case it is solving, and
there's zero benefit in writing codecs.encode(zlib) compared to e.g.
zlib.compress().

A transform() or untransform() method, however, allows for a much more
convenient spelling, with easy cascading, e.g.:

b.transform(zlib).transform(base64)

In other words, there's zero justification for codecs.encode() and
codecs.decode(). The fact that the codecs machinery works on arbitrary
object transformation is a pointless genericity, if it doesn't bring
any additional convenience compared to the canonical functions in their
respective modules.

 At this point, the only person that can get me to revert this clarification
 of MAL's original vision for the codecs module is Guido, since anything
 else completely fails to address the Python 3 adoption barrier posed by the
 current state of Python 3's binary codec support.

I'd like to challenge your assertion that your change addresses
anything.

It's not easier to change b.encode(zlib) into codecs.encode(zlib,
b), than it is to change it into zlib.compress(b).

Regards,

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Steven D'Aprano
On Fri, Nov 15, 2013 at 05:13:34PM +1000, Nick Coghlan wrote:

 A few things I noticed while implementing the recent updates:
 
 - as you noted in your other email, while MAL is on record as saying
 the codecs module is intended for arbitrary codecs, not just Unicode
 encodings, readers of the current docs can definitely be forgiven for
 not realising that. We really need to better separate the codecs
 module docs from the text model docs (two new sections in the language
 reference, one for the codecs machinery and one for the text model
 would likely be appropriate. The io module docs and those for the
 builtin open function may also be affected)
 - a mechanism for annotating frames would help avoid the need for
 nasty hacks like the exception wrapping that aims to make codec
 failures easier to debug
 - if codecs exposed a way to separate the input type check from the
 invocation of the codec, we could redirect users to the module API for
 bad input types as well (e.g. calling input str.encode(bz2)

 - if we want something that doesn't need to be imported, then encode()
 and decode() builtins make more sense than new methods on str, bytes
 and bytearray objects (since builtins would support memoryview and
 array.array as well, and it avoids ambiguity regarding the direction
 of the operation)

Sounds good to me.

 - the codecs module should offer a way to register a new alias for an
 existing codec
 - the codecs module should offer a way to map a name to a CodecInfo
 object without registering a new search function

It would be really good to be able to query the available codecs. For 
example, many applications offer an Encoding menu, where you can 
specify the codec used for text. That's hard in Python, since you 
can't retrieve a list of known codecs.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Serhiy Storchaka

15.11.13 12:02, Steven D'Aprano написав(ла):

It would be really good to be able to query the available codecs. For
example, many applications offer an Encoding menu, where you can
specify the codec used for text. That's hard in Python, since you
can't retrieve a list of known codecs.


And you can't determine which codec is binary-text encoding.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Steven D'Aprano
On Fri, Nov 15, 2013 at 10:22:28AM +0100, Antoine Pitrou wrote:
 On Fri, 15 Nov 2013 09:03:37 +1000 Nick Coghlan ncogh...@gmail.com wrote:
  
   And add transform() and untransform() methods to bytes and str types.
   In practice, it might be same codecs registry for all codecs just with
   a new attribute.
  
  This is completely the wrong approach. There's zero justification for
  adding new builtin methods for this use case - encoding and decoding are
  generic operations, they should use functions not methods.
 
 I'm sorry, I disagree. The question is what use case it is solving, and
 there's zero benefit in writing codecs.encode(zlib) compared to e.g.
 zlib.compress().

One benefit is:

import codecs
codec = get_name_of_compression_codec()
result = codecs.encode(data, codec)


versus:


codec = get_name_of_compression_codec()
if codec == zlib:
import zlib
encoder = zlib.compress
elif codec == bz2
import bz2
encoder = bz2.compress
elif codec == gzip:
import gzip
encoder = gzip.compress
elif codec == squash:
import mySquashLib
encoder = mySquashLib.squash
elif ...:
# and so on
result = encoder(data)



 A transform() or untransform() method, however, allows for a much more
 convenient spelling, with easy cascading, e.g.:
 
 b.transform(zlib).transform(base64)

Yes, that's quite nice. Although it need not be a method, a built-in 
function works for me too:

# either of these:
transform(transform(b, zlib), base64)
encode(encode(b, zlib), base64)


If encoding/decoding is intended to be completely generic (even if 99% 
of the uses will be with strings and bytes), is there any reason to 
prefer built-in functions rather than methods on object?


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Antoine Pitrou
On Fri, 15 Nov 2013 21:28:35 +1100
Steven D'Aprano st...@pearwood.info wrote:
 
 One benefit is:
 
 import codecs
 codec = get_name_of_compression_codec()
 result = codecs.encode(data, codec)

That's a good point.

 If encoding/decoding is intended to be completely generic (even if 99% 
 of the uses will be with strings and bytes), is there any reason to 
 prefer built-in functions rather than methods on object?

Practicality beats purity. Personally, I've never used codecs on
anything else than str and bytes objects.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Serhiy Storchaka

15.11.13 12:28, Steven D'Aprano написав(ла):

One benefit is:

import codecs
codec = get_name_of_compression_codec()
result = codecs.encode(data, codec)


And this is a hole in a security if you don't check codec name before 
calling a codec. See topic about utilizing zip-bombs via codecs machinery.


Also usually you need more than just uncompress binary data by Python 
name. You need map external compression name to internal Python codec 
name, you need configure decompressor object by specific options, 
perhaps you need different buffering strategies for different 
compression algorithms. See for example zipfile and tarfile sources.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Nick Coghlan
On 15 November 2013 20:33, Antoine Pitrou solip...@pitrou.net wrote:
 On Fri, 15 Nov 2013 21:28:35 +1100
 Steven D'Aprano st...@pearwood.info wrote:

 One benefit is:

 import codecs
 codec = get_name_of_compression_codec()
 result = codecs.encode(data, codec)

 That's a good point.

 If encoding/decoding is intended to be completely generic (even if 99%
 of the uses will be with strings and bytes), is there any reason to
 prefer built-in functions rather than methods on object?

 Practicality beats purity. Personally, I've never used codecs on
 anything else than str and bytes objects.

The reason I'm now putting some effort into better documenting the
status quo for codec handling in Python 3 and filing off some of the
rough edges (rather than proposing adding any new APIs to Python 3.x)
is because the users I care about in this matter are web developers
that already make use of the binary codecs and are adopting the
single-source approach to handle supporting both Python 2 and Python
3. Armin Ronacher is the one who's been most vocal about the problem,
but he's definitely not alone.

A new API for binary transforms is potentially an academically
interesting concept, but it solves zero current real world problems.
By contrast, being clear about the fact that codecs.encode and
codecs.decode exist and are available as far back as Python 2.4 helps
to eliminate a genuine barrier to Python 3 adoption for a subset of
the community.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Antoine Pitrou
On Fri, 15 Nov 2013 21:45:31 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
 The reason I'm now putting some effort into better documenting the
 status quo for codec handling in Python 3 and filing off some of the
 rough edges (rather than proposing adding any new APIs to Python 3.x)
 is because the users I care about in this matter are web developers
 that already make use of the binary codecs and are adopting the
 single-source approach to handle supporting both Python 2 and Python
 3. Armin Ronacher is the one who's been most vocal about the problem,
 but he's definitely not alone.

zlib.compress(something) works on both Python 2 and Python 3, why do
you need something else?

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Victor Stinner
2013/11/15 Nick Coghlan ncogh...@gmail.com:
 The reason I'm now putting some effort into better documenting the
 status quo for codec handling in Python 3 and filing off some of the
 rough edges (rather than proposing adding any new APIs to Python 3.x)
 is because the users I care about in this matter are web developers
 that already make use of the binary codecs and are adopting the
 single-source approach to handle supporting both Python 2 and Python
 3. Armin Ronacher is the one who's been most vocal about the problem,
 but he's definitely not alone.

Except of Armin Ronacher, I never see anyway blocked when trying to
port a project to Python3 because of these bytes=bytes and str=str
codecs. I did a quick search on Google but I failed to find a question
how can I write .encode(hex) or .encode(zlib) in Python 3?. It
was just a quick search, it's likely that many developers hit this
Python 3 regression, but I'm confident that developers are able to
workaround themself this regression (ex: use directly the right Python
module).

I saw a lot of huge code base ported to Python 3 without the need of
these codecs. For example: Django which is a web framework has been
ported on Python 3, I know that Armin Ronacher also works on web
things (I don't know what exactly).

 A new API for binary transforms is potentially an academically
 interesting concept, but it solves zero current real world problems.

I would like to reply the same for these codecs: they are not solving
any real world problem :-)

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Paul Moore
On 15 November 2013 12:07, Victor Stinner victor.stin...@gmail.com wrote:
 A new API for binary transforms is potentially an academically
 interesting concept, but it solves zero current real world problems.

 I would like to reply the same for these codecs: they are not solving
 any real world problem :-)

As Nick is only documenting long-existing functions, I fail to see the
issue here.

If someone were to propose new methods, builtins, or module functions,
then I could see a reason for debate. But surely simply documenting
existing functions is not worth all this pushback?

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread M.-A. Lemburg
On 15.11.2013 12:45, Nick Coghlan wrote:
 On 15 November 2013 20:33, Antoine Pitrou solip...@pitrou.net wrote:
 On Fri, 15 Nov 2013 21:28:35 +1100
 Steven D'Aprano st...@pearwood.info wrote:

 One benefit is:

 import codecs
 codec = get_name_of_compression_codec()
 result = codecs.encode(data, codec)

 That's a good point.

 If encoding/decoding is intended to be completely generic (even if 99%
 of the uses will be with strings and bytes), is there any reason to
 prefer built-in functions rather than methods on object?

 Practicality beats purity. Personally, I've never used codecs on
 anything else than str and bytes objects.
 
 The reason I'm now putting some effort into better documenting the
 status quo for codec handling in Python 3 and filing off some of the
 rough edges (rather than proposing adding any new APIs to Python 3.x)
 is because the users I care about in this matter are web developers
 that already make use of the binary codecs and are adopting the
 single-source approach to handle supporting both Python 2 and Python
 3. Armin Ronacher is the one who's been most vocal about the problem,
 but he's definitely not alone.

You can add me to that list :-). Esp. the hex codec is very handy.
Google returns a few thousand hits for that codec alone.

One detail that people often tend to forget is the extensibility
of the codec system. It is easily possible to add new codecs
to the system to e.g. perform encoding, escaping, compression or
other conversion operations, so the set of codecs in the stdlib
is not the complete set of codecs used in the wild - and it's
not intended to be.

As example: We've written codecs for customers that perform
special types of XML un/escaping.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 15 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2013-11-19: Python Meeting Duesseldorf ...  4 days to go

: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Facundo Batista
On Thu, Nov 14, 2013 at 7:32 PM, Victor Stinner
victor.stin...@gmail.com wrote:

 I would prefer to split the registry of codecs to have 3 registries:

 - encoding (a better name can found): encode str=bytes, decode bytes=str
 - bytes: encode bytes=bytes, decode bytes=bytes
 - str:  encode str=str, decode str=str

 And add transform() and untransform() methods to bytes and str types.
 In practice, it might be same codecs registry for all codecs just with
 a new attribute.

I like this idea very much.

But to see IIUC, let me be more explicit... you'll have (of course,
always py3k-speaking):

- bytes.decode() - str ... here you can only use unicode encodings
- no bytes.encode(), like today
- bytes.transform() - bytes ... here you can only use things like
zlib, rot13, etc

- str.encode() - bytes ... here you can only use unicode encodings
- no str.decode(), like today
- str.transform() - str ... here you can only use things like... like what?

When to use decode/encode was always a major pain point for people, so
doing this extra separation and cleaning would bring more clarity to
when to use what.

Thanks!

--
.Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
Twitter: @facundobatista
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Nick Coghlan
On 15 November 2013 22:24, Paul Moore p.f.mo...@gmail.com wrote:
 On 15 November 2013 12:07, Victor Stinner victor.stin...@gmail.com wrote:
 A new API for binary transforms is potentially an academically
 interesting concept, but it solves zero current real world problems.

 I would like to reply the same for these codecs: they are not solving
 any real world problem :-)

 As Nick is only documenting long-existing functions, I fail to see the
 issue here.

 If someone were to propose new methods, builtins, or module functions,
 then I could see a reason for debate. But surely simply documenting
 existing functions is not worth all this pushback?

There's a bit more to it than that (and that's why I started the other
thread about the codec aliases before proceeding to the final step).

One of the changes Victor is concerned about is that when you use an
incorrect codec in one of the Unicode-encoding-only convenience
methods, the recent exception updates explicitly push users towards
using those module level functions instead:

 import codecs
 no good.encode(rot_13)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use
codecs.encode() to encode to arbitrary types
 codecs.encode(just fine, rot_13)
'whfg svar'

 bno good.decode(quopri_codec)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: 'quopri_codec' decoder returned 'bytes' instead of 'str';
use codecs.decode() to decode to arbitrary types
 codecs.decode(bjust fine, quopri_codec)
b'just fine'

My perspective is that, in current Python, that *is* the right thing
for people to do, and any hypothetical new API proposed for Python 3.5
would do nothing to change what's right for Python 3.4 code (or Python
2/3 compatible code). I also find it bizarre that several of those
arguing that this is too niche a feature to be worth refining are
simultaneously in favour of a proposal to add new *methods on builtin
types* for the same niche feature.

The other part is the fact that I updated the What's New document to
highlight these tweaks:
http://docs.python.org/dev/whatsnew/3.4.html#improvements-to-handling-of-non-unicode-codecs

As noted earlier in the thread, Armin Ronacher has been the most vocal
of the users of this feature in Python 2 that lamented it's absence in
Python 3 (see, for example,
http://lucumr.pocoo.org/2012/8/11/codec-confusion/), but I've also
received plenty of subsequent feedback along the lines of what he
said! (such as http://bugs.python.org/issue7475#msg187630).

Many of the proposed solutions from the people affected by the change
haven't been usable (since they've often been based on a
misunderstanding of why the method behaviour changed in Python 3 in
the first place), but the pain they experience is genuine, and it can
unnecessarily sour their whole experience of the transition. I
consider documenting the existing module level functions and nudging
users towards them when they try to use the affected codecs to be an
expedient way to say yes, this is still available if you really want
to use it, but the required spelling is different.

However, the one thing I'm *not* going to do at this point is restore
the shorthand aliases, so those opposing the lowering of this barrier
to transition can take comfort in the fact they have succeeded in
ensuring that the out-of-the-box experience for users of this feature
migrating from Python 2 remains the unfriendly:

 babcdef.decode(hex)
Traceback (most recent call last):
  File stdin, line 1, in module
LookupError: unknown encoding: hex

Rather than the more useful:

 babcdef.decode(hex)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use
codecs.decode() to decode to arbitrary types

Which would then lead them to the working (and still Python 2 compatible) code:

 codecs.decode(babcdef, hex)
b'\xab\xcd\xef'

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Antoine Pitrou
On Fri, 15 Nov 2013 23:50:23 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
 My perspective is that, in current Python, that *is* the right thing
 for people to do, and any hypothetical new API proposed for Python 3.5
 would do nothing to change what's right for Python 3.4 code (or Python
 2/3 compatible code). I also find it bizarre that several of those
 arguing that this is too niche a feature to be worth refining are
 simultaneously in favour of a proposal to add new *methods on builtin
 types* for the same niche feature.

I am not claiming it is a niche feature, I am claiming codecs.encode()
and codecs.decode() don't solve the use case like the .transform()
and .untransform() methods do.

(I do think it is a nice feature in Python 2, although I find myself
using it mainly at the interpreter prompt, rather than in production
code)

 As noted earlier in the thread, Armin Ronacher has been the most vocal
 of the users of this feature in Python 2 that lamented it's absence in
 Python 3 (see, for example,
 http://lucumr.pocoo.org/2012/8/11/codec-confusion/), but I've also
 received plenty of subsequent feedback along the lines of what he
 said! (such as http://bugs.python.org/issue7475#msg187630).

The way I read it, the positive feedback was about .transform()
and .untransform(), not about recommending people switch to
codecs.encode() and codecs.decode().

 Rather than the more useful:
 
  babcdef.decode(hex)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use
 codecs.decode() to decode to arbitrary types

I think this may be confusing.  TypeError seems to suggest that the
parameter type sent by the user to the method is wrong, which is not
the actual cause of the error.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Nick Coghlan
On 16 November 2013 00:04, Antoine Pitrou solip...@pitrou.net wrote:
 Rather than the more useful:

  babcdef.decode(hex)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use
 codecs.decode() to decode to arbitrary types

 I think this may be confusing.  TypeError seems to suggest that the
 parameter type sent by the user to the method is wrong, which is not
 the actual cause of the error.

The TypeError isn't new, only the part after the semi-colon telling
them that codecs.decode() doesn't include the typecheck (because it
isn't constrained by the text model).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Stephen J. Turnbull
Walter Dörwald writes:
  Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com:
   
   15.11.13 00:32, Victor Stinner написав(ла):
   And add transform() and untransform() methods to bytes and str types.
   In practice, it might be same codecs registry for all codecs just with
   a new attribute.
   
   If the transform() method will be added, I prefer to have only
   one transformation method and specify a direction by the
   transformation name (bzip2/unbzip2).
  
  +1

-1

I can't support adding such methods (and that's why I ended up giving
Nick's proposal for exposing codecs.encode and codecs.decode a +1).
People think about these transformations as en- or de-coding, not
transforming, most of the time.  Even for a transformation that is
an involution (eg, rot13), people have an very clear idea of what's
encoded and what's not, and they are going to prefer the names
encode and decode for these (generic) operations in many cases.

Eg, I don't think s.transform(decoder) is an improvement over
decode(s, codec) (but tastes vary).[1]  It does mean that we need
to add a redundant method, and I don't really see an advantage to it.
The semantics seem slightly off to me, since the purpose of the
operation is to create a new object, not transform the original
in-place.  (But of course str.encode and bytes.decode are precedents
for those semantics.)


Footnotes: 
[1]  Arguments decoder and codec are identifiers, not metavariables.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Ethan Furman

On 11/14/2013 11:13 PM, Nick Coghlan wrote:


The proposal I posted to issue 7475 back in April (and, in the absence
of any objections to the proposal, finally implemented over the past
few weeks) was to take advantage of the fact that the codecs.encode
and codecs.decode convenience functions exist (and have been covered
by the regression test suite) as far back as Python 2.4. I did this
merely by documenting the existing of the functions for Python 2.7,
3.3 and 3.4, changing the exception messages thrown for codec output
type errors on the convenience methods to reference them, and by
updating the Python 3.4 What's New document to explain the changes.


Thanks for doing this work, Nick!

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Antoine Pitrou
On Sat, 16 Nov 2013 00:46:15 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 On 16 November 2013 00:04, Antoine Pitrou solip...@pitrou.net wrote:
  Rather than the more useful:
 
   babcdef.decode(hex)
  Traceback (most recent call last):
File stdin, line 1, in module
  TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use
  codecs.decode() to decode to arbitrary types
 
  I think this may be confusing.  TypeError seems to suggest that the
  parameter type sent by the user to the method is wrong, which is not
  the actual cause of the error.
 
 The TypeError isn't new,

Really? That's not what your message said.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Walter Dörwald
Am 15.11.2013 um 16:57 schrieb Stephen J. Turnbull step...@xemacs.org:
 
 Walter Dörwald writes:
 Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com:
 
 15.11.13 00:32, Victor Stinner написав(ла):
 And add transform() and untransform() methods to bytes and str types.
 In practice, it might be same codecs registry for all codecs just with
 a new attribute.
 
 If the transform() method will be added, I prefer to have only
 one transformation method and specify a direction by the
 transformation name (bzip2/unbzip2).
 
 +1
 
 -1
 
 I can't support adding such methods (and that's why I ended up giving
 Nick's proposal for exposing codecs.encode and codecs.decode a +1).

My +1 was only for having the transformation be one-way under the condition 
that it is added at all.

 People think about these transformations as en- or de-coding, not
 transforming, most of the time.  Even for a transformation that is
 an involution (eg, rot13), people have an very clear idea of what's
 encoded and what's not, and they are going to prefer the names
 encode and decode for these (generic) operations in many cases.
 
 Eg, I don't think s.transform(decoder) is an improvement over
 decode(s, codec) (but tastes vary).[1]  It does mean that we need
 to add a redundant method, and I don't really see an advantage to it.

Actually my preferred method would be codec.decode(s). codec being the module 
that implements the functionality.

I don't think we need to invent another function registry.

 The semantics seem slightly off to me, since the purpose of the
 operation is to create a new object, not transform the original
 in-place.

This would mean the method would have to be called transformed()?

  (But of course str.encode and bytes.decode are precedents
 for those semantics.)
 
 
 Footnotes: 
 [1]  Arguments decoder and codec are identifiers, not metavariables.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Nick Coghlan
On 16 Nov 2013 02:36, Antoine Pitrou solip...@pitrou.net wrote:

 On Sat, 16 Nov 2013 00:46:15 +1000
 Nick Coghlan ncogh...@gmail.com wrote:
  On 16 November 2013 00:04, Antoine Pitrou solip...@pitrou.net wrote:
   Rather than the more useful:
  
babcdef.decode(hex)
   Traceback (most recent call last):
 File stdin, line 1, in module
   TypeError: 'hex' decoder returned 'bytes' instead of 'str'; use
   codecs.decode() to decode to arbitrary types
  
   I think this may be confusing.  TypeError seems to suggest that the
   parameter type sent by the user to the method is wrong, which is not
   the actual cause of the error.
 
  The TypeError isn't new,

 Really? That's not what your message said.

The second example in my post included restoring the hex alias for
hex_codec (its absence is the reason for the current unknown encoding
error). The 3.2 and 3.3 error message for a restored alias would have been
TypeError: 'hex' decoder returned 'bytes' instead of 'str', which I agree
is confusing and uninformative - that's why I added the reference to the
module level functions to the output type errors *before* proposing the
restoration of the aliases.

So you can already use codecs.decode(s, 'hex_codec') in Python 3, you
just won't get a useful error leading you there if you use the more common
'hex' alias instead.

To address Serhiy's security concerns with the compression codecs (which
are technically independent of the question of restoring the aliases), I
also plan to document how to systematically blacklist particular codecs in
an application by setting attributes on the encodings module and/or
appropriate entries in sys.modules.

Finally, I now plan to write a documentation PEP that suggests clearly
splitting the codecs module docs into two layers: the type agnostic core
infrastructure and the specific application of that infrastructure to the
implementation of the text encoding model.

The only functional *change* I'd still like to make for 3.4 is to restore
the shorthand aliases for the non-Unicode codecs (to ease the migration for
folks coming from Python 2), but this thread has convinced me I likely need
to write the PEP *before* doing that, and I still have to integrate
ensurepip into pyvenv before the beta 1 deadline.

So unless you and Victor are prepared to +1 the restoration of the codec
aliases (closing issue 7475) in anticipation of that codecs infrastructure
documentation PEP, the change to restore the aliases probably won't be in
3.4. (I *might* get the PEP written in time regardless, but I'm not betting
on it at this point).

Cheers,
Nick.


 Regards

 Antoine.
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Victor Stinner
2013/11/16 Nick Coghlan ncogh...@gmail.com:
 To address Serhiy's security concerns with the compression codecs (which are
 technically independent of the question of restoring the aliases), I also
 plan to document how to systematically blacklist particular codecs in an
 application by setting attributes on the encodings module and/or appropriate
 entries in sys.modules.

I would be simpler and safer to blacklist bytes=bytes and str=str
codecs from bytes.decode() and str.encode() directly. Marc Andre
Lemburg proposed to add new attributes in CodecInfo to specify input
and output types.

 The only functional *change* I'd still like to make for 3.4 is to restore
 the shorthand aliases for the non-Unicode codecs (to ease the migration for
 folks coming from Python 2), but this thread has convinced me I likely need
 to write the PEP *before* doing that, and I still have to integrate
 ensurepip into pyvenv before the beta 1 deadline.

 So unless you and Victor are prepared to +1 the restoration of the codec
 aliases (closing issue 7475) in anticipation of that codecs infrastructure
 documentation PEP, the change to restore the aliases probably won't be in
 3.4. (I *might* get the PEP written in time regardless, but I'm not betting
 on it at this point).

Using StackOverflow search engine, I found some posts where people
asks for hex codec on Python 3. There are two answers: use binascii
module or use codecs.encode(). So even if codecs.encode() was never
documented, it looks like it is used. So I now agree that documenting
it would not make the situation worse.

Adding transform()/untransform() method to bytes and str is a non
trivial change and not everybody likes them. Anyway, it's too late for
Python 3.4.

In my opinion, the best option is to add new input_type/output_type
attributes to CodecInfo right now, and modify the codecs so
abc.encode(hex) raises a LookupError (instead of tricky error
message with some evil low-level hacks on the traceback and the
exception, which is my initial concern in this mail thread). It fixes
also the security vulnerability.

To keep backward compatibility (even with custom codecs registered
manually), if input_type/output_type is not defined, we should
consider that the codec is a classical text encoding (encode
str=bytes, decode bytes=str).

The type of codecs.encode() result is my least concern in this topic.

I created the following issue to implement my idea:
http://bugs.python.org/issue19619

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Victor Stinner
Oh, I forgot to mention that I sent this email in reaction to this issue:

http://bugs.python.org/issue19585

Modifying the critical PyFrameObject because the codecs API raises
surprising errors doesn't sound correct. I prefer to fix how codecs
are used, than modifying the PyFrameObject.

For more information, see the issue #7475 which a long history (4
years) and many messages. Martin von Loewis wrote I would still be
opposed to such a change, and I think it needs a PEP. and I still
agree with him on this point. Because they are different opinions and
no consensus, a PEP is required to explain why we took this decision
and list rejected alternatives.

http://bugs.python.org/issue7475

Victor

2013/11/14 Victor Stinner victor.stin...@gmail.com:
 Hi,

 I saw that Nick Coghlan documented codecs.encode() and
 codecs.decode(), and changed the exception raised when codecs like
 rot_13 are used on bytes.decode() and str.encode().

 I don't like the functions codecs.encode() and codecs.decode() because
 the type of the result depends on the encoding (second parameter). We
 try to avoid this in Python.

 I would prefer to split the registry of codecs to have 3 registries:

 - encoding (a better name can found): encode str=bytes, decode bytes=str
 - bytes: encode bytes=bytes, decode bytes=bytes
 - str:  encode str=str, decode str=str

 And add transform() and untransform() methods to bytes and str types.
 In practice, it might be same codecs registry for all codecs just with
 a new attribute.

 Examples:

 - utf8: encoding
 - zlib: bytes
 - rot13: str

 The result type of bytes.transform/untransform would be bytes, and the
 result type of str.transform/untransform would be str.

 I don't know which exception should be raised when a codec is used in
 the wrong method. LookupError? TypeError codec xxx cannot be used
 with method xxx.xx? Something else?

 codecs.encode/decode() documentation should be removed. The functions
 should be kept, just in case if someone uses them.

 Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Nick Coghlan
On 15 Nov 2013 08:34, Victor Stinner victor.stin...@gmail.com wrote:

 Hi,

 I saw that Nick Coghlan documented codecs.encode() and
 codecs.decode(), and changed the exception raised when codecs like
 rot_13 are used on bytes.decode() and str.encode().

 I don't like the functions codecs.encode() and codecs.decode() because
 the type of the result depends on the encoding (second parameter). We
 try to avoid this in Python.

The type signature of those functions is just object - object (Similar to
the way the 2.x convenience methods were actually basestring - basestring).

 I would prefer to split the registry of codecs to have 3 registries:

 - encoding (a better name can found): encode str=bytes, decode
bytes=str
 - bytes: encode bytes=bytes, decode bytes=bytes
 - str:  encode str=str, decode str=str


You have to get it out of your head that codecs are just about text and and
binary data. They're not: they're arbitrary type transforms, and MAL
deliberately wrote the module that way.

 And add transform() and untransform() methods to bytes and str types.
 In practice, it might be same codecs registry for all codecs just with
 a new attribute.

This is completely the wrong approach. There's zero justification for
adding new builtin methods for this use case - encoding and decoding are
generic operations, they should use functions not methods.

What could be useful is allowing CodecInfo objects to supply an expected
input type and an expected output type (ABCs and instance check
overrides make that quite flexible).


 Examples:

 - utf8: encoding
 - zlib: bytes
 - rot13: str

 The result type of bytes.transform/untransform would be bytes, and the
 result type of str.transform/untransform would be str.

 I don't know which exception should be raised when a codec is used in
 the wrong method. LookupError? TypeError codec xxx cannot be used
 with method xxx.xx? Something else?

We already do this check in the existing convenience methods - it raises
TypeError.


 codecs.encode/decode() documentation should be removed. The functions
 should be kept, just in case if someone uses them.

No. They're part of the regression test suite, and have been since Python
2.4. They embody MAL's intended arbitrary type transform library
approach. They provide a source compatible mechanism for using binary
codecs in single code base Python 2/3 projects.

At this point, the only person that can get me to revert this clarification
of MAL's original vision for the codecs module is Guido, since anything
else completely fails to address the Python 3 adoption barrier posed by the
current state of Python 3's binary codec support.

Note that the only behavioural changes in the commits so far were to
exception handling - everything else was just docs.

The next planned commit (to restore the binary codec aliases) *is* a
behavioural change - that's why I posted to the list about it (it received
only two responses, both +1)

Cheers,
Nick.


 Victor
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Nick Coghlan
On 15 Nov 2013 08:42, Victor Stinner victor.stin...@gmail.com wrote:

 Oh, I forgot to mention that I sent this email in reaction to this issue:

 http://bugs.python.org/issue19585

 Modifying the critical PyFrameObject because the codecs API raises
 surprising errors doesn't sound correct. I prefer to fix how codecs
 are used, than modifying the PyFrameObject.

 For more information, see the issue #7475 which a long history (4
 years) and many messages. Martin von Loewis wrote I would still be
 opposed to such a change, and I think it needs a PEP. and I still
 agree with him on this point. Because they are different opinions and
 no consensus, a PEP is required to explain why we took this decision
 and list rejected alternatives.

 http://bugs.python.org/issue7475

Martin wrote that before it was pointed out there were existing functions
to handle the problem (I was asking for a PEP back then, too).

I posted my plan for dealing with this months ago without receiving any
complaints, and I'm annoyed you waited until I had actually followed
through and implemented it to complain about it and ask for Python 3's
binary codec support to stay broken instead :P

(Starting a new thread instead of replying to the one where I specifically
asked about taking the next step does nothing to improve my mood)

Regards,
Nick.


 Victor

 2013/11/14 Victor Stinner victor.stin...@gmail.com:
  Hi,
 
  I saw that Nick Coghlan documented codecs.encode() and
  codecs.decode(), and changed the exception raised when codecs like
  rot_13 are used on bytes.decode() and str.encode().
 
  I don't like the functions codecs.encode() and codecs.decode() because
  the type of the result depends on the encoding (second parameter). We
  try to avoid this in Python.
 
  I would prefer to split the registry of codecs to have 3 registries:
 
  - encoding (a better name can found): encode str=bytes, decode
bytes=str
  - bytes: encode bytes=bytes, decode bytes=bytes
  - str:  encode str=str, decode str=str
 
  And add transform() and untransform() methods to bytes and str types.
  In practice, it might be same codecs registry for all codecs just with
  a new attribute.
 
  Examples:
 
  - utf8: encoding
  - zlib: bytes
  - rot13: str
 
  The result type of bytes.transform/untransform would be bytes, and the
  result type of str.transform/untransform would be str.
 
  I don't know which exception should be raised when a codec is used in
  the wrong method. LookupError? TypeError codec xxx cannot be used
  with method xxx.xx? Something else?
 
  codecs.encode/decode() documentation should be removed. The functions
  should be kept, just in case if someone uses them.
 
  Victor
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Serhiy Storchaka

15.11.13 01:03, Nick Coghlan написав(ла):

We already do this check in the existing convenience methods - it raises
TypeError.


The problem with this check is that it happens *after* 
encoding/decoding. This opens door for DoS (see my last message).



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Nick Coghlan
On 15 Nov 2013 09:11, Nick Coghlan ncogh...@gmail.com wrote:


 On 15 Nov 2013 08:42, Victor Stinner victor.stin...@gmail.com wrote:
 
  Oh, I forgot to mention that I sent this email in reaction to this
issue:
 
  http://bugs.python.org/issue19585
 
  Modifying the critical PyFrameObject because the codecs API raises
  surprising errors doesn't sound correct. I prefer to fix how codecs
  are used, than modifying the PyFrameObject.
 
  For more information, see the issue #7475 which a long history (4
  years) and many messages. Martin von Loewis wrote I would still be
  opposed to such a change, and I think it needs a PEP. and I still
  agree with him on this point. Because they are different opinions and
  no consensus, a PEP is required to explain why we took this decision
  and list rejected alternatives.
 
  http://bugs.python.org/issue7475

 Martin wrote that before it was pointed out there were existing functions
to handle the problem (I was asking for a PEP back then, too).

 I posted my plan for dealing with this months ago without receiving any
complaints, and I'm annoyed you waited until I had actually followed
through and implemented it to complain about it and ask for Python 3's
binary codec support to stay broken instead :P

Something I *would* be entirely happy to do is write a retroactive PEP
after beta 1  is out the door, explaining the history of this issue in a
more coherent form than the comment history on issue 7475 and the many
child issues it spawned.

This would also provide a better launching point for other enhancements in
Python 3.5 (frame annotations to remove the need for the exception chaining
hack and better input validation mechanisms for codecs that allow the
convenience methods to check that case explicitly rather than relying on
the exception chaining).

Cheers,
Nick.


 (Starting a new thread instead of replying to the one where I
specifically asked about taking the next step does nothing to improve my
mood)

 Regards,
 Nick.

 
  Victor
 
  2013/11/14 Victor Stinner victor.stin...@gmail.com:
   Hi,
  
   I saw that Nick Coghlan documented codecs.encode() and
   codecs.decode(), and changed the exception raised when codecs like
   rot_13 are used on bytes.decode() and str.encode().
  
   I don't like the functions codecs.encode() and codecs.decode() because
   the type of the result depends on the encoding (second parameter). We
   try to avoid this in Python.
  
   I would prefer to split the registry of codecs to have 3 registries:
  
   - encoding (a better name can found): encode str=bytes, decode
bytes=str
   - bytes: encode bytes=bytes, decode bytes=bytes
   - str:  encode str=str, decode str=str
  
   And add transform() and untransform() methods to bytes and str types.
   In practice, it might be same codecs registry for all codecs just with
   a new attribute.
  
   Examples:
  
   - utf8: encoding
   - zlib: bytes
   - rot13: str
  
   The result type of bytes.transform/untransform would be bytes, and the
   result type of str.transform/untransform would be str.
  
   I don't know which exception should be raised when a codec is used in
   the wrong method. LookupError? TypeError codec xxx cannot be used
   with method xxx.xx? Something else?
  
   codecs.encode/decode() documentation should be removed. The functions
   should be kept, just in case if someone uses them.
  
   Victor
  ___
  Python-Dev mailing list
  Python-Dev@python.org
  https://mail.python.org/mailman/listinfo/python-dev
  Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Serhiy Storchaka

15.11.13 00:32, Victor Stinner написав(ла):

And add transform() and untransform() methods to bytes and str types.
In practice, it might be same codecs registry for all codecs just with
a new attribute.


If the transform() method will be added, I prefer to have only one 
transformation method and specify a direction by the transformation name 
(bzip2/unbzip2).


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Terry Reedy

On 11/14/2013 5:32 PM, Victor Stinner wrote:


I don't like the functions codecs.encode() and codecs.decode() because
the type of the result depends on the encoding (second parameter). We
try to avoid this in Python.


Such dependence is common with arithmetic.

 1 + 2
3
 1 + 2.0
3.0
 1 + 2+0j
(3+0j)

 sum((1,2,3), 0)
6
 sum((1,2,3), 0.0)
6.0
 sum((1,2,3), 0.0+0j)
(6+0j)

for f in (compile, eval, getattr, iter, max, min, next, open, pow, 
round, type, vars):

  type(f(*args)) # depends on the inputs
That is a large fraction of the non-class builtin functions.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Terry Reedy

On 11/14/2013 6:03 PM, Nick Coghlan wrote:


You have to get it out of your head that codecs are just about text and
and binary data.


99+% of the current codec module doc leads one to that impression. The 
fact that codecs are expected to have a file reader and writer and that 
the default 'strict' error handler is specified in 2 out of the 3 mostly 
redundant lists as raising a UnicodeError reinforces the impression.



They're not: they're arbitrary type transforms, and MAL
deliberately wrote the module that way.


Generic functions are quite pythonic. However, I am not sure how much 
benefit there is to registering an arbitrary pair of bijective functions



This is completely the wrong approach. There's zero justification for
adding new builtin methods for this use case - encoding and decoding are
generic operations, they should use functions not methods.


Making 23 code easier is certainly a good reason for the codecs approach.


The next planned commit (to restore the binary codec aliases) *is* a
behavioural change - that's why I posted to the list about it (it
received only two responses, both +1)


If I understand correctly, I am mildly +1, but did not respond, thinking 
that 2 to 0 was sufficient response for you to continue ;-).


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Walter Dörwald
Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com:
 
 15.11.13 00:32, Victor Stinner написав(ла):
 And add transform() and untransform() methods to bytes and str types.
 In practice, it might be same codecs registry for all codecs just with
 a new attribute.
 
 If the transform() method will be added, I prefer to have only one 
 transformation method and specify a direction by the transformation name 
 (bzip2/unbzip2).

+1

Some of the transformations might not be revertible (s.transform(lower)? ;))

And the transform function probably doesn't need any error handling machinery.

What about the stream/iterator/incremental parts of the codec API?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Nick Coghlan
On 15 November 2013 11:10, Terry Reedy tjre...@udel.edu wrote:
 On 11/14/2013 5:32 PM, Victor Stinner wrote:

 I don't like the functions codecs.encode() and codecs.decode() because
 the type of the result depends on the encoding (second parameter). We
 try to avoid this in Python.


 Such dependence is common with arithmetic.

 1 + 2
 3
 1 + 2.0
 3.0
 1 + 2+0j
 (3+0j)

 sum((1,2,3), 0)
 6
 sum((1,2,3), 0.0)
 6.0
 sum((1,2,3), 0.0+0j)
 (6+0j)

 for f in (compile, eval, getattr, iter, max, min, next, open, pow, round,
 type, vars):
   type(f(*args)) # depends on the inputs
 That is a large fraction of the non-class builtin functions.

*Type* dependence between inputs and outputs is common (and completely
non-controversial). The codecs system is different, since the
supported input and output types are *value* dependent, driven by the
name of the codec.

That's the part which makes the codec machinery interesting in
general, since it combines a value driven lazy loading mechanism
(based on the codec name) with the subsequent invocation of that
mechanism: the default codec search algorithm goes hunting in the
encodings package (or the alias dictionary), but you can register
custom search algorithms and provide encodings any way you want. It
does mean, however, that the most you can claim for the type signature
of codecs.encode and codecs.decode is that they accept an object and
return an object. Beyond that, it's completely driven by the value of
the codec.

In Python 2.x, the type constraints imposed by the str and unicode
convenience methods is basestring in, basestring out. As it happens,
all of the standard library codecs abide by that restriction , so it
was easy to interpret the codecs module itself as having the same
basestring in, basestring out limitation, especially given the heavy
focus on text encodings in the way it was documented. In practice, the
codecs weren't that open ended - some of them only accepted 8 bit
strings, some only accepted unicode, some accepted both (perhaps
relying on implicit decoding to unicode),

The migration to Python 3 made the contrast between the two far more
stark however, hence the long and involved discussion on issue 7475,
and the fact that the non-Unicode codecs are currently still missing
their shorthand aliases.

The proposal I posted to issue 7475 back in April (and, in the absence
of any objections to the proposal, finally implemented over the past
few weeks) was to take advantage of the fact that the codecs.encode
and codecs.decode convenience functions exist (and have been covered
by the regression test suite) as far back as Python 2.4. I did this
merely by documenting the existing of the functions for Python 2.7,
3.3 and 3.4, changing the exception messages thrown for codec output
type errors on the convenience methods to reference them, and by
updating the Python 3.4 What's New document to explain the changes.

This approach provides a Python 2/3 compatible solution for usage of
non-Unicode encodings: users simply need to call the existing module
level functions in the codecs module, rather than using the methods on
specific builtin types. This approach also means that the binary
codecs can be used with any bytes-like object (including memoryview
and array.array), rather than being limited to types that implement a
new method (like transform), and can also be used in Python 2/3
source compatible APIs (since the data driven nature of the problem
makes 2to3 unusable as a solution, and that doesn't help single code
base projects anyway).

From my point of view, this is now just a matter of better documenting
the status quo, and nudging people in the right direction when it
comes to using the appropriate API for non-Unicode codecs. Since we
now realise these functions have existed since Python 2.4, it doesn't
make sense to try to fundamentally change direction, but instead to
work on making it better.

A few things I noticed while implementing the recent updates:

- as you noted in your other email, while MAL is on record as saying
the codecs module is intended for arbitrary codecs, not just Unicode
encodings, readers of the current docs can definitely be forgiven for
not realising that. We really need to better separate the codecs
module docs from the text model docs (two new sections in the language
reference, one for the codecs machinery and one for the text model
would likely be appropriate. The io module docs and those for the
builtin open function may also be affected)
- a mechanism for annotating frames would help avoid the need for
nasty hacks like the exception wrapping that aims to make codec
failures easier to debug
- if codecs exposed a way to separate the input type check from the
invocation of the codec, we could redirect users to the module API for
bad input types as well (e.g. calling input str.encode(bz2)
- if we want something that doesn't need to be imported, then encode()
and decode() builtins make more sense than new methods on str, bytes