date:20130722

Re: Multirelease effort: Moving to Python 3

2013-07-22 Thread Nick Coghlan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/22/2013 03:25 PM, Toshio Kuratomi wrote:
> On Mon, Jul 22, 2013 at 10:15:31AM +1000, Nick Coghlan wrote:
>> -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
>> 
>> On 07/20/2013 06:11 AM, Toshio Kuratomi wrote:
>>> pythonic is a very vague statement and I wouldn't consider most
>>> of your list to be examples of those.  Yes, python3 may be a
>>> *better* language (and I would include most of your list as
>>> "features of python3 that python2 does not have) but a more
>>> pythonic language... that's not something that you can readily
>>> measure.  For instance, I can make the case that python3's
>>> unicode handling is less pythonic than python2 as it violates
>>> three rules of the zen of python:
>>> 
>>> Explicit is better than implicit. Errors should never pass 
>>> silently. In the face of ambiguity, refuse the temptation to 
>>> guess.
>>> 
>>> (To be fair, python2 violated some of these rules in its
>>> unicode handling as well, although errors should never pass
>>> silently would probably take some work to convince most people
>>> :-)
>> 
>> The *only* reason Python 3 allows any Unicode errors to pass
>> silently is because Python 2 tolerated broken system
>> configurations (like non-UTF-8 filesystem metadata on nominally
>> UTF-8 systems) by treating them as opaque 8-bit strings that were
>> retrieved from OS interfaces and then passed back unmodified (see
>> PEP 383 for details). If Python 3 didn't work on those systems,
>> people would blame Python 3, not the already broken system
>> configuration ("But Python 2 works, why is Python 3 so broken?").
>> os.listdir() -> open() is the canonical example of the kind of
>> "round trip" activity that we felt we needed to support even for
>> systems with improperly encoded metadata (including file names).
>> 
> 
> Actually, surrogateescape is a *great* improvement over the
> previous python3 behaviour of silently dropping data that it did
> not understand.

That behaviour only existed in 3.1. It's one of the reasons nobody
really used 3.1 for anything ;)

> If python3 could just finally fix outputting text with
> surrogateescaped bytes then it would finally clean up the last
> portion of this and I would be able to stop pointing out the
> various ways that python3's unicode handling is just as broken as
> pyhton2's -- just in different ways. :-)

Attempting to encode data containing surrogate escapes without setting
"errors=surrogateescape" is a sign that tainted data has escaped
somewhere. So it's late notification of an error, but still indicative
of an error somewhere. We'll never silence it by default.


>> Tainting would involve having the surrogateescape codec set an 
>> attribute on a string recording the encoding assumption if it had
>> to embed any surrogates in the Private Use Area, as well as a
>> keyword only "taint" argument to decode operations (e.g. to force
>> tainting when using "latin-1" as a universal text codec). Various
>> string operations would then be modified to use the following
>> rules:
>> 
>> * Both input strings untainted? Output is untainted. * One input
>> tainted, one untainted? Output is tainted with the same 
>> assumption as the tainted input * Both inputs tainted with the
>> same assumption? Output is also tainted with that assumption.
>> 
> This sounds like it might be nice.  The one thing I'm a little
> unsure about is that it sounds like code is going to have to handle
> this explicitly. Judging from the way all but a select few people
> handle Text vs encoded bytes right now, that seems like it won't
> achieve very much.  OTOH, I could see this as being an additional
> bit of information that's entirely optional whether people use it.
> I think that could be helpful in some cases of debugging.  (OTOH,
> often when encoding vs text issues arise it's because the coder and
> program have no way to know the correct encoding.  When that 
> happens, so the extra information might not be that useful for the
> majority of cases anyway).
> 
>> * Inputs tainted with different assumptions? Immediate
>> ValueError complaining about the taint mismatch
>> 
>> String encoding would be updated to trigger a ValueError when
>> asked to encode tainted strings to an encoding other than the
>> tainted one.
>> 
> 
> I'm a little leery of these.  The reason is that after using both
> python2 and the early versions of python3 I became a firm believer
> that the problem with python2's unicode handling wasn't that it
> threw exceptions, rather the problem was that the same bit of code
> was too prone to passing through certain data without error and
> throwing an error with other data. Programmers who tested their
> code with only ascii data or only data encoded in their locale's
> encoding, or only when their locale was a utf-8 encoding were
> unable to replicate or understand the errors that their user's got
> when they ran them in the crazy real-world environments that user's
> inevitably

Re: Multirelease effort: Moving to Python 3

2013-07-22 Thread Nick Coghlan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/23/2013 12:42 AM, Toshio Kuratomi wrote:
> On Mon, Jul 22, 2013 at 05:15:50PM +1000, Nick Coghlan wrote:
>> -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
>> 
>> On 07/22/2013 03:25 PM, Toshio Kuratomi wrote:
>> 
>>> If python3 could just finally fix outputting text with 
>>> surrogateescaped bytes then it would finally clean up the last 
>>> portion of this and I would be able to stop pointing out the 
>>> various ways that python3's unicode handling is just as broken
>>> as pyhton2's -- just in different ways. :-)
>> 
>> Attempting to encode data containing surrogate escapes without
>> setting "errors=surrogateescape" is a sign that tainted data has
>> escaped somewhere. So it's late notification of an error, but
>> still indicative of an error somewhere. We'll never silence it by
>> default.
>> 
> That's a bit simplified from what python3's direction on this is
> unless Victor Stinner's work is only intended to be temporary.
> 
> $ export LC_ALL=en_US.utf-8 $ mkdir abc$'\xff' $ python3.3
 import os se_dirlisting = os.listdir('.') # surrogateescape
 in a text string: repr(se_dirlisting[0])
> "'abc\\udcff'"
 # This doesn't traceback and it has to encode se_dirlisting
 when passing # it out of python: 
 os.listdir(se_dirlisting[0])
> []
 # Works with other modules as well: import subprocess 
 subprocess.call(['ls', se_dirlisting[0]])
> 0

For some APIs we set surrogateescape on the user's behalf. It's still
getting set, though :)

> AFAIK, the justification is that the surrogateescape'd strings are
> both coming from and going to the OS.  They're crossing outside of
> the line that python3 draws around itself and there's an implicit
> encoding and decoding there.  This seems fine to me as a strategy.
> The problems are just that there are places where python3 doesn't
> yet use surrogateescape when crossing this boundary.

Strictly speaking, there are a bunch of interfaces that we declare as
operating based on "os.fsencode" and "os.fsdecode". The fact we're not
especially clear on which encoding/decoding strategy we use for
particular APIs is the docs gap Armin was talking about in
http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/

The following (if I recall correctly) all use os.fsencode/decode:

  os.environ
  os.listdir
  sys.argv
  os.exec* functions
  subprocess environ
  subprocess arguments

We *should* use os.fsdecode to ensure file name attributes are always
unicode, but currently do not (I just filed a bug for that:
http://bugs.python.org/issue18534).

> The one I was specifically thinking of when I wrote this was the
> print() function:
> 
 print(se_dirlisting[0])
> Traceback (most recent call last): File "", line 3, in
>  UnicodeEncodeError: 'utf-8' codec can't encode character
> '\udcff' in position 3: surrogates not allowed

Unfortunately, the standard streams *don't* currently use the same
scheme as we assume for other operating system APIs. The reason we
don't is because we're reluctant to assume that all data received over
those streams will be in an ASCII or UTF-8 compatible encoding (e.g.
our feedback from Japan is that there are still plenty of systems
there using non-ASCII compatible encodings)

> (When I mentioned this at pycon you brought up: 
> http://bugs.python.org/issue15216 which looked promising but seems
> to have stalled ;-)

Alas, there's currently no champion to drive it. I'm interested, but
don't have enough time (improving the Unicode handling is third on the
list of "big problems in Python" that I currently care about, after
packaging and the initialisation code). Several of the other core devs
are sufficiently dubious of the notion of allowing mojibake to be
created silently that they're conflicted on offering the feature at
all, and thus not motivated to work on it :(

Since absolutely nobody in the world cares enough about the CPython
upstream to pay *anyone* to work on it full time (not even the Linux
distros or the members of the OpenStack foundation), this situation is
unlikely to change any time soon :(

 * Inputs tainted with different assumptions? Immediate 
 ValueError complaining about the taint mismatch

 String encoding would be updated to trigger a ValueError
 when asked to encode tainted strings to an encoding other
 than the tainted one.

>>> 
>>> I'm a little leery of these.  The reason is that after using
>>> both python2 and the early versions of python3 I became a firm
>>> believer that the problem with python2's unicode handling
>>> wasn't that it threw exceptions, rather the problem was that
>>> the same bit of code was too prone to passing through certain
>>> data without error and throwing an error with other data.
>>> Programmers who tested their code with only ascii data or only
>>> data encoded in their locale's encoding, or only when their
>>> locale was a utf-8 encoding were unable to replicate or
>>> understand the er

Re: Multirelease effort: Moving to Python 3

2013-07-22 Thread Bohuslav Kabrda

- Original Message -
> On Thu, Jul 18, 2013 at 11:24:22AM -0400, Bohuslav Kabrda wrote:
> > 3) Making all livecd packages depend on Python 3 by default (and
> > eventually getting rid of Python 2 from livecd) - this will also require
> > switching from Yum to DNF as a default, that is supposed to support Python
> 
> I have a concern about bloating @core and by extension the cloud image.
> Right now, python is about 5% of the total on-disk usage. I'd hate to see
> that go to 10%. Therefore, I'd like to see a goal of making the transition
> for usage in the base cloud image go entirely from python2 to python3 in
> one release cycle.
> 
> (Roughly, that's @core + cloud-init, which isn't currently on your list.)
> 

Doing everything in one shot sounds reasonable from this POV, I'll try to put 
this into my plan (I'll go through cloud-init the same way I went through 
livecd and post my findings here).

> 
> --
> Matthew Miller  ☁☁☁  Fedora Cloud Architect  ☁☁☁  

-- 
Regards,
Bohuslav "Slavek" Kabrda.
___
python-devel mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/python-devel

Re: Multirelease effort: Moving to Python 3

Re: Multirelease effort: Moving to Python 3

Re: Multirelease effort: Moving to Python 3

3 matches

Site Navigation

Mail list logo

Footer information