date:20090430

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Jeroen Ruigrok van der Werven

-On [20090430 07:18], Martin v. Löwis (mar...@v.loewis.de) wrote:
Suppose I create a new directory, and run the following script
in 3.x:

py open(x,w).close()
py open(b\xff,w).close()
py os.listdir(.)
['x']

That is actually a regression in 3.x:

Python 2.6.1 (r261:67515, Mar  8 2009, 11:36:21) 
 import os
 open(x,w).close()
 open(b\xff,w).close()
 os.listdir(.)
['x', '\xff']

[Apologies if that was completely clear through the entire discussion, but
I've lost track at a given point.]

-- 
Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Heart is the engine of your body, but Mind is the engine of Life...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel

On Wed, Apr 29, 2009 at 23:03, Terry Reedy tjre...@udel.edu wrote:

 Thomas Breuel wrote:


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


 Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
 surrogates.


 By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows
 that.


If we use conformance to Unicode 5.1 as the basis for our discussion, then
PEP 383 is off the table anyway.  I'm all for strict Unicode compliance.
But apparently, the Python community doesn't care.

CESU-8 is described in Unicode Technical Report #26, so it at least has some
official recognition.  More importantly, it's also widely used.  So, my
question: what are the implications of PEP 383 for CESU-8 encodings on
Python?

My meta-point is: there are probably many more such issues hidden away and
it is a really bad idea to rush something like PEP 383 out.  Unicode is hard
anyway, and tinkering with its semantics requires a lot of thought.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher c...@hagenlocher.orgwrote:

  IronPython will inherit whatever behavior Mono has implemented. The
 Microsoft CLR defines the native string type as UTF-16 and all of the
 managed APIs for things like file names and environmental variables
 operate on UTF-16 strings -- there simply are no byte string APIs.


Yes.  Now think about the implications.  This means that adopting PEP 383
will make IronPython and Jython running on UNIX intrinsically incompatible
with CPython running on UNIX, and there's no way to fix that.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

Jeroen Ruigrok van der Werven wrote:
 -On [20090430 07:18], Martin v. Löwis (mar...@v.loewis.de) wrote:
 Suppose I create a new directory, and run the following script
 in 3.x:

 py open(x,w).close()
 py open(b\xff,w).close()
 py os.listdir(.)
 ['x']
 
 That is actually a regression in 3.x:

Correct - and precisely the issue that this PEP wants to address.

For comparison, do os.listdir(u.), though:

py os.listdir(u.)
[u'x', '\xff']

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

Thomas Breuel wrote:
 On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher c...@hagenlocher.org
 mailto:c...@hagenlocher.org wrote:
 
 IronPython will inherit whatever behavior Mono has implemented. The
 Microsoft CLR defines the native string type as UTF-16 and all of the
 managed APIs for things like file names and environmental variables
 operate on UTF-16 strings -- there simply are no byte string APIs.
 
 
 Yes.  Now think about the implications.  This means that adopting PEP
 383 will make IronPython and Jython running on UNIX intrinsically
 incompatible with CPython running on UNIX, and there's no way to fix that. 

*Not* adapting the PEP will also make CPython and IronPython
incompatible, and there's no way to fix that.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

Given the stated rationale of PEP 383, I was wondering what Windows actually
does.  So, I created some ISO8859-15 and ISO8859-8 encoded file names on a
device, plugged them into my Windows Vista machine, and fired up Python 3.0.

First, os.listdir(f:) returns a list of strings for those file names...
but those unicode strings are illegal.

You can't even print them without getting an error from Python.  In fact,
you also can't print strings containing the proposed half-surrogate
encodings either: in both cases, the output encoder rejects them with a
UnicodeEncodeError.   (If not even Python, with its generally lenient
attitude, can print those things, some other libraries probably will fail,
too.)

What about round tripping? So, if you take a malformed file name from an
external device (say, because it was actually encoded iso8859-15 or East
Asian) and write it to an NTFS directory, it seems to write malformed UTF-16
file names.  In essence, Windows doesn't really use unicode, it just
implements 16bit raw character strings, just like UNIX historically
implements raw 8bit character strings.

Then I tried the same thing on my Ubuntu 9.04 machine.It turns out that,
unlike Windows, Linux is seems to be moving to consistent use of valid
UTF-8.  If you plug in an external device and nothing else is known about
it, it gets mounted with the utf8 option and the kernel actually seems to
enforce UTF-8 encoding.   I think this calls into question the rationale
behind PEP 383, and we should first look into what the roadmap for
UNIX/Linux and UTF-8 actually is.  UNIX may have consistent unicode support
(via UTF-8) before Windows.

As I was saying, I think PEP 383 needs a lot more thought and research...

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

  Yes.  Now think about the implications.  This means that adopting PEP
  383 will make IronPython and Jython running on UNIX intrinsically
  incompatible with CPython running on UNIX, and there's no way to fix
 that.

 *Not* adapting the PEP will also make CPython and IronPython
 incompatible, and there's no way to fix that.


CPython and IronPython are incompatible.  And they will stay incompatible if
the PEP is adopted.

They would become compatible if CPython adopted Mono and/or Java semantics.


Since both have had to deal with this, have you looked at what they actually
do before proposing PEP 383?  What did you find?  Why did you choose an
incompatible approach for PEP 383?

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman

On approximately 4/29/2009 8:46 PM, came the following characters from 
the keyboard of Terry Reedy:

Glenn Linderman wrote:
On approximately 4/29/2009 1:28 PM, came the following characters from 



So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


If you think a section of the doc is grossly inadequate, and there is no 
existing issue on the tracker, feel free to add one.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.


Where the PEP proposes to change, it should start with the status quo. 
But Martin's somewhat reasonable position is that since he is not 
proposing to change behavior on Windows, it is not his responsibility to 
document what he is not proposing to change more adequately.  This 
means, of course, that any observed change on Windows would then be a 
bug, or at least a break of the promise.  On the other hand, I can see 
that this is enough related to what he is proposing to change that 
better doc would help.



Yes; the very fact that the PEP discusses Windows, speaks about 
cross-platform code, and doesn't explicitly state that no Windows 
functionality will change, is confusing.


An example of how to initialize things within a sample cross-platform 
application might help, especially if that initialization only happens 
if the platform is POSIX, or is commented to the effect that it has no 
effect on Windows, but makes POSIX happy.  Or maybe it is all buried 
within the initialization of Python itself, and is not exposed to the 
application at all.  I still haven't figured that out, but was not (and 
am still not) as concerned about that as ensuring that the overall 
algorithms are functional and useful and user-friendly.  Showing it 
might have been helpful in making it clear that no Windows functionality 
would change, however.


A statement that additional features are being added to allow 
cross-platform programs deal with non-decodable bytes obtained from 
POSIX APIs using the same code that already works on Windows, would have 
made things much clearer.  The present Abstract does, in fact, talk only 
about POSIX, but later statements about Windows muddy the water.


Rationale paragraph 3, explicitly talks about cross-platform programs 
needing to work one way on Windows and another way on POSIX to deal with 
all the cases.  It calls that a proposal, which I guess it is for 
command line and environment, but it is already implemented in both 
bytes and str forms for file names... so that further muddies the water.


It is, of course, easier to point out deficiencies in a document than to 
write a better document; however, it is incumbent upon the PEP author to 
write a PEP that is good enough to get approved, and that means making 
it understandable enough that people are in favor... or to respond to 
the plethora of comments until people are in favor.  I'm not sure which 
one is more time-consuming.


I've reached the point, based on PEP and comment responses, where I now 
believe that the PEP is a solution to the problem it is trying to solve, 
and doesn't create ambiguities in the naming.  I don't believe it is the 
best solution.


The basic problem is the overuse of fake characters... normalizing them 
for display results is large data loss -- many characters would be 
translated to the same replacement characters.


Solutions exist that would allow the use of fewer different fake 
characters in the strings, while still having a fake character as the 
escape character, to preserve the invariant that all the strings 
manipulated by python-escape from the PEP were, and become, strings 
containing fake characters (from a strict Unicode perspective), which is 
a nice invariant*.  There even exist solutions that would use only one 
fake character (repeatedly if necessary), and all other characters 
generated would be displayable characters.  This would ease the burden 
on the program in displaying the strings, and also on the user that 
might view the resulting mojibake in trying to differentiate one such 
string from another.  Those are outlined in various emails in this 
thread, although some include my misconception that strings obtained via 
 Unicode-enabled OS APIs would also need to be encoded and altered.  If 
there is any interest in using a more readable encoding, I'd be glad to 
rework them to remove those misconceptions.


* It would be nice to point out that invariant in the PEP, also.


--
Glenn -- http://nevcal.com/
===

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Glenn Linderman

On approximately 4/29/2009 10:17 PM, came the following characters from 
the keyboard of Martin v. Löwis:
I don't understand the proposal and issues. I see a lot of people 
claiming that they do, and then spending all their time either 
talking past each other, or disagreeing. If everyone who claims they 
understand the issues actually does, why is it so hard to reach a 
consensus?


Because the problem is difficult, and any solution has trade-offs.
People disagree on which trade-offs are worse than others.

I'd like to see some real examples of how things can break in the 
current system


Suppose I create a new directory, and run the following script
in 3.x:

py open(x,w).close()
py open(b\xff,w).close()
py os.listdir(.)
['x']



but...

py os.listdir(b.)
['x', '\xff']



If I quit Python, I can now do

mar...@mira:~/work/3k/t$ ls
?  x
mar...@mira:~/work/3k/t$ ls -b
\377  x

As you can see, there are two files in the current directory, but
only one of them is reported by os.listdir. The same happens to
command line arguments and environment variables: Python might swallow
some of them.



There is presently no solution for command line and environment 
variables, I guess... which adds some amount of urgency to the 
implementation of _something_, even if not this PEP.



and I'd like any potential solution to be made 
available as a third-party package before it goes into the standard 
library (if possible).


Unfortunately, at least for my solution, this isn't possible. I need
to change the implementation of the existing file IO APIs.



Other than initializing them to use UTF-8b instead of UTF-8, and to use 
the new python-escape handler?  I'm sure if I read the code for that, 
I'd be able to figure out the answer...  I don't find any documented way 
of adding an encoding/decoding handler to the file IO encoding 
technique, though which lends credence to your statement, but then that 
could also be an oversight on my part.


One could envision a staged implementation: the addition of the ability 
to add encoding/decoding handlers to the file IO encoding/decoding 
process, and the external selection of your new python-escape handler 
during application startup.  That way, the hooks would be in the file 
system to allow your solution to be used, but not require that it be 
used; competing solutions using similar technology could be implemented 
and evaluated.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman

On approximately 4/29/2009 7:50 PM, came the following characters from 
the keyboard of Aahz:

On Thu, Apr 30, 2009, Cameron Simpson wrote:

The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
_not_ well-formed unicode (== have bare surrogates in them) but that
were intended for use as filenames will conflict with the PEP's scheme -
programs must know that these strings came from outside and must be
translated into the PEP's funny-encoding before use in the os.*
functions. Previous to the PEP they would get used directly and
encode differently after the PEP, thus producing different POSIX
filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
using a rare-in-filenames character.
That would avoid the issue with outside' strings that contain
surrogates. To my mind it just moves the punning from rare illegal
strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
os.listdir but a subclass of string (or at least a duck-type of
string) that knows where it came from and is also handily
recognisable as not-really-a-string for purposes of deciding
whether is it PEP-funny-encoded by direct inspection.


Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.


I'll agree that once other misconceptions were explained away, that the 
remaining issues are those Cameron summarized.  Thanks for the summary!


Point two could be modified because I've changed my opinion; I like the 
invariant Cameron first (I think) explicitly stated about the PEP as it 
stands, and that I just reworded in another message, that the strings 
that are altered by the PEP in either direction are in the subset of 
strings that contain fake (from a strict Unicode viewpoint) characters. 
 I still think an encoding that uses mostly real characters that have 
assigned glyphs would be better than the encoding in the PEP; but would 
now suggest that an escape character be a fake character.


I'll note here that while the PEP encoding causes illegal bytes to be 
translated to one fake character, the 3-byte sequence that looks like 
the range of fake characters would also be translated to a sequence of 3 
fake characters.  This is 512 combinations that must be translated, and 
understood by the user (or at least by the programmer).  The escape 
sequence approach requires changing only 257 combinations, and each 
altered combination would result in exactly 2 characters.  Hence, this 
seems simpler to understand, and to manually encode and decode for 
debugging purposes.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

Thomas Breuel wrote:
 Given the stated rationale of PEP 383, I was wondering what Windows
 actually does.  So, I created some ISO8859-15 and ISO8859-8 encoded file
 names on a device, plugged them into my Windows Vista machine, and fired
 up Python 3.0.

How did you do that, and what were the specific names that you
had chosen? How does explorer display the file names?

 First, os.listdir(f:) returns a list of strings for those file
 names... but those unicode strings are illegal.

What was the exact result that you got?

 You can't even print them without getting an error from Python.

This is unrelated to the PEP. Try to run the same code in IDLE,
or use the ascii() function.

 What about round tripping? So, if you take a malformed file name from an
 external device (say, because it was actually encoded iso8859-15 or East
 Asian) and write it to an NTFS directory, it seems to write malformed
 UTF-16 file names.  In essence, Windows doesn't really use unicode, it
 just implements 16bit raw character strings, just like UNIX historically
 implements raw 8bit character strings.

I think you misinterpreted what you saw. To find out what way you
misinterpreted it, we would have to know what it is that you saw.

 I think this calls into
 question the rationale behind PEP 383, and we should first look into
 what the roadmap for UNIX/Linux and UTF-8 actually is.  UNIX may have
 consistent unicode support (via UTF-8) before Windows.

If so, PEP 383 won't hurt. If you never get decode errors for file
names, you can just ignore PEP 383. It's only for those of us who do
get decode errors.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 CPython and IronPython are incompatible.  And they will stay
 incompatible if the PEP is adopted.
 
 They would become compatible if CPython adopted Mono and/or Java
 semantics. 

Which one should it adopt? Mono semantics, or Java semantics?

 Since both have had to deal with this, have you looked at what they
 actually do before proposing PEP 383?  What did you find?  

See

http://mail.python.org/pipermail/python-3000/2007-September/010450.html

 Why did you choose an incompatible approach for PEP 383?

Because in Python, we want to be able to access all files on disk.
Neither Java nor Mono are capable of doing that.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Martin v. Löwis

I checked how GUI libraries deal with half surrogates.
In pygtk, a warning gets issued to the console

/tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to
pango_layout_set_text()
  self.window.show()

and then the widget contains three crossed boxes.

wxpython (in its wxgtk version) behaves the same way.

PyQt displays a single square box.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Glenn Linderman

On approximately 4/30/2009 1:48 AM, came the following characters from 
the keyboard of Martin v. Löwis:

I checked how GUI libraries deal with half surrogates.
In pygtk, a warning gets issued to the console

/tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to
pango_layout_set_text()
  self.window.show()

and then the widget contains three crossed boxes.

wxpython (in its wxgtk version) behaves the same way.

PyQt displays a single square box.



Interesting.

Did you use a name with other characters?  Were they displayed?  Both 
before and after the surrogates?


Did you use one or three half surrogates, to produce the three crossed 
boxes?


Did you use one or three half surrogates, to produce the single square box?

--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

 Assuming people agree that this is an accurate summary, it should be
 incorporated into the PEP.

Done!

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

 I think it has to be excluded from mapping in order to not introduce
 security issues.

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel


  Since both have had to deal with this, have you looked at what they
  actually do before proposing PEP 383?  What did you find?

 See

 http://mail.python.org/pipermail/python-3000/2007-September/010450.html


Thanks, that's very useful.


  Why did you choose an incompatible approach for PEP 383?

 Because in Python, we want to be able to access all files on disk.
 Neither Java nor Mono are capable of doing that.


OK, so what's wrong with os.listdir() and similar functions returning a
unicode string for strings that correctly encode/decode, and with byte
strings for strings that are not valid unicode?

The file I/O functions already seem to deal with byte strings correctly, you
never get byte strings on platforms that are fully unicode, and they are
well supported.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 OK, so what's wrong with os.listdir() and similar functions returning a
 unicode string for strings that correctly encode/decode, and with byte
 strings for strings that are not valid unicode? 

See http://bugs.python.org/issue3187
in particular msg71655

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Antoine Pitrou

Thomas Breuel tmbdev at gmail.com writes:
 
 So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device,
plugged them into my Windows Vista machine, and fired up Python 3.0.First,
os.listdir(f:) returns a list of strings for those file names... but those
unicode strings are illegal.

Sorry, when you report such experiments, is it too much to ask for a cut and
paste of your Python session?

You are being unhelpful with such unsubstantiated statements, and your mails are
taking a lot of valuable bandwidth.

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

On Thu, Apr 30, 2009 at 12:32, Martin v. Löwis mar...@v.loewis.de wrote:

  OK, so what's wrong with os.listdir() and similar functions returning a
  unicode string for strings that correctly encode/decode, and with byte
  strings for strings that are not valid unicode?

 See http://bugs.python.org/issue3187
 in particular msg71655


Why didn't you point to that discussion from the PEP 383?  And why didn't
you point to Kowalczyk's message on encodings in Mono, Java, etc. from the
PEP?  You could have saved us all a lot of time.

Under the set of constraints that Guido imposes, plus the requirement that
round-trip works for illegal encodings, there is no other solution than PEP
383.  That doesn't make PEP 383 right--I still think it's a bad
decision--but it makes it pointless to discuss it any further.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Paul Moore

2009/4/30 Martin v. Löwis mar...@v.loewis.de:
 OK, so what's wrong with os.listdir() and similar functions returning a
 unicode string for strings that correctly encode/decode, and with byte
 strings for strings that are not valid unicode?

 See http://bugs.python.org/issue3187
 in particular msg71655

Can I suggest that a pointer to this issue be added to the PEP? It
certainly seems like a lot of the discussion of options available is
captured there. And the fact that Guido's views are noted there is
also useful (as he hasn't been contributing to this thread).

2009/4/30 Thomas Breuel tmb...@gmail.com:
  Since both have had to deal with this, have you looked at what they
  actually do before proposing PEP 383?  What did you find?

 See

 http://mail.python.org/pipermail/python-3000/2007-September/010450.html

 Thanks, that's very useful.

This reference could probably be usefully added to the PEP as well.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread glyph



On 08:25 am, mar...@v.loewis.de wrote:

Why did you choose an incompatible approach for PEP 383?


Because in Python, we want to be able to access all files on disk.
Neither Java nor Mono are capable of doing that.


Java is not capable of doing that.  Mono, as I keep pointing out, is. 
It uses NULLs to escape invalid UNIX filenames.  Please see:


http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding

The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, 
access, and open all files on your filesystem, regardless of encoding.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 There are several different ways I tried it.  The easiest was to mount a
 vfat file system with various encodings on Linux and use the Python byte
 interface to write file names, then plug that flash drive into Windows.

So can you share precisely what you have done, to allow others to
reproduce it?

 I think you misinterpreted what you saw. To find out what way you
 misinterpreted it, we would have to know what it is that you saw.
 
 
 I didn't interpret it much at all.  I'm just saying that the PEP 383
 assumption that these problems can't occur on Windows isn't true.

What are these problems, and where does PEP 383 say they can't occur
on Windows? What could Python do differently on Windows?

 I can plug in a flash drive with malformed strings, and somewhere
 between the disk and Python, something maps those strings onto unicode
 in some way, and it's done in a way that's different from PEP 383.

Of course it is. The Windows FAT driver has chosen some mapping for the
file names to Unicode, and most likely not the encoding that you meant
it to use.

There is now no way for a Win32 application to find out how the file
name is actually represented on disk, short of implementing the FAT
file system itself.

So what Python does is the best possible solution already - report the
file names as-is, with no interpretation.

 My point remains that I think PEP 383 shouldn't be rushed through, and
 one should look more carefully first at what the Windows kernel does in
 these situations, and what Mono and Java do.

These questions really have been studied on this list for the last eight
years, over and over again. It's not being rushed.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel


 Java is not capable of doing that.  Mono, as I keep pointing out, is. It
 uses NULLs to escape invalid UNIX filenames.  Please see:

 http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding

 The upshot to all this is that Mono.Unix and Mono.Unix.Native can list,
 access, and open all files on your filesystem, regardless of encoding.


OK, so why not adopt the Mono solution in CPython?  It seems to produce
valid unicode strings, removing at least one issue with PEP 383.  It also
means that IronPython and CPython actually would be compatible.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread R. David Murray


On Thu, 30 Apr 2009 at 11:26, gl...@divmod.com wrote:

On 08:25 am, mar...@v.loewis.de wrote:

 Why did you choose an incompatible approach for PEP 383?

Because in Python, we want to be able to access all files on disk.
Neither Java nor Mono are capable of doing that.


Java is not capable of doing that.  Mono, as I keep pointing out, is. It uses 
NULLs to escape invalid UNIX filenames.  Please see:


http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding

The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, 
access, and open all files on your filesystem, regardless of encoding.


And then it goes on to say: You won't be able to pass non-Unicode
filenames as command-line arguments.(*)  Not only that, but you can't
reliably use such files with System.IO (whatever that is, but it
sounds pretty basic).  This support is only available within the
Mono.Unix and Mono.Unix.Native namespaces.  Now, I don't know what
that means (never having touched Mono), but it doesn't sound like
it simplifies cross-platform support, which is what PEP 383 is aiming for.

So it doesn't sound like Mono has solved the problem that Martin is
trying to solve, even if it is possible to put Unix specific code into
your Mono ap to deal with byte filenames on disk from within your GUI.

FWIW I'm +1 on seeing PEP 383 in 3.1, if Martin can manage the patch
in time.

--David

(*) I'd argue that in an important sense that makes Martin's statement
about Mono being unable to access all files on disk a true statement; but,
then, I freely admit that I have a bias against GUI programs in general :)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

On Thu, Apr 30, 2009 at 10:21, Martin v. Löwis mar...@v.loewis.de wrote:

 Thomas Breuel wrote:
  Given the stated rationale of PEP 383, I was wondering what Windows
  actually does.  So, I created some ISO8859-15 and ISO8859-8 encoded file
  names on a device, plugged them into my Windows Vista machine, and fired
  up Python 3.0.

 How did you do that, and what were the specific names that you
 had chosen?


There are several different ways I tried it.  The easiest was to mount a
vfat file system with various encodings on Linux and use the Python byte
interface to write file names, then plug that flash drive into Windows.


 I think you misinterpreted what you saw. To find out what way you
 misinterpreted it, we would have to know what it is that you saw.


I didn't interpret it much at all.  I'm just saying that the PEP 383
assumption that these problems can't occur on Windows isn't true.

I can plug in a flash drive with malformed strings, and somewhere between
the disk and Python, something maps those strings onto unicode in some way,
and it's done in a way that's different from PEP 383.  Mono and Java must
have their own solutions that are different from PEP 383.

My point remains that I think PEP 383 shouldn't be rushed through, and one
should look more carefully first at what the Windows kernel does in these
situations, and what Mono and Java do.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis


 Why didn't you point to that discussion from the PEP 383?  And why
 didn't you point to Kowalczyk's message on encodings in Mono, Java, etc.
 from the PEP?  

Because I assumed that readers of the PEP would know (and I'm sure
many of them do - this has been *really* discussed over and over again).

 Under the set of constraints that Guido imposes, plus the requirement
 that round-trip works for illegal encodings, there is no other solution
 than PEP 383.

Well, there actually is an alternative: expose byte-oriented interfaces
in parallel with the string-oriented ones. In the rationale, the PEP
explains why I consider this the worse choice.

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Martin v. Löwis

 Did you use a name with other characters?  Were they displayed?  Both
 before and after the surrogates?

Yes, yes, and yes (IOW, I put the surrogate in the middle).

 Did you use one or three half surrogates, to produce the three crossed
 boxes?

Only one, and it produced three boxes - probably one for each UTF-8 byte
that pango considered invalid.

 Did you use one or three half surrogates, to produce the single square box?

Again, only one. Apparently, PyQt passes the Python Unicode string to Qt
in a character-by-character representation, rather than going through UTF-8.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] PEP 382 update

2009-04-30 Thread Martin v. Löwis

Guido found out that I had misunderstood the existing
pkg mechanism: If a zope package is imported, and
it uses pkgutil.extend_path, then it won't glob for files
ending in .pkg, but instead searches the path for
files named zope.pkg.

IOW, this is unsuitable as a foundation of PEP 382. I have
now changed the PEP to call the files .pth, more in line
with how top-level .pth files work, and added a statement
that the import feature of .pth files is not provided for
package .pth files (use __init__.py instead).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel


 And then it goes on to say: You won't be able to pass non-Unicode
 filenames as command-line arguments.(*)  Not only that, but you can't
 reliably use such files with System.IO (whatever that is, but it
 sounds pretty basic).  This support is only available within the
 Mono.Unix and Mono.Unix.Native namespaces.  Now, I don't know what
 that means (never having touched Mono), but it doesn't sound like
 it simplifies cross-platform support, which is what PEP 383 is aiming for.


The problem there isn't how the characters are quoted, but that they are
quoted at all, and that the ECMA and Microsoft libraries don't understand
this quoting convention.  Since command line parsing is handled through
ECMA, you happen not to be able to get at those files (that's fixable, but
why bother).

The analogous problem exists with Martin's proposal on Python: if you pass a
unicode string from Python to some library through a unicode API and that
library attempts to open the file, it will fail because it doesn't use the
proposed Python utf-8b decoder.  There just is no way to fix that, no matter
which quoting convention you use.

In contrast to PEP 383, quoting with u at least results in valid unicode
strings in Python.  And command line arguments (and environment variables
etc.) would work in Python because in Python, those should also use the new
encoding for invalid UTF-8 inputs.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 Because in Python, we want to be able to access all files on disk.
 Neither Java nor Mono are capable of doing that.
 
 Java is not capable of doing that.  Mono, as I keep pointing out, is. It
 uses NULLs to escape invalid UNIX filenames.  Please see:
 
 http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding
 
 The upshot to all this is that Mono.Unix and Mono.Unix.Native can list,
 access, and open all files on your filesystem, regardless of encoding.

I think this is misleading. With Mono 2.0.1, I get

** (/tmp/a.exe:30553): WARNING **: FindNextFile: Bad encoding for
'/home/martin/work/3k/t/\xff'
Consider using MONO_EXTERNAL_ENCODINGS

when running the program

using System.IO;
class X{
  public static void Main(string[] args){
DirectoryInfo di = new DirectoryInfo(.);
foreach(FileInfo fi in di.GetFiles())
  System.Console.WriteLine(Next:+fi.Name);
  }
}

On the other hand, when I write

using Mono.Unix;
class X{
  public static void Main(string[] args){
UnixDirectoryInfo di = new UnixDirectoryInfo(.);
foreach(UnixFileSystemInfo fi in di.GetFileSystemEntries())
  System.Console.WriteLine(Next:+fi.Name);
  }
}

I get indeed all files listed (and can also find out the other
stat results). Of course, the resulting application will be
mono-specific (it links with Mono.Posix), and not work on Microsoft
.NET anymore. IOW, IronPython likely won't use this API.

Python, of course, already has the equivalent of that: os.listdir,
with a byte parameter, will give you access to all files. If
you wanted to closely emulate the Mono API, you could set
the file system encoding to the mono-lookalike codec.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 OK, so why not adopt the Mono solution in CPython?  It seems to produce
 valid unicode strings, removing at least one issue with PEP 383.  It
 also means that IronPython and CPython actually would be compatible.

See my other message. The Mono solution may not be what you expect it to be.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel



  The upshot to all this is that Mono.Unix and Mono.Unix.Native can list,
  access, and open all files on your filesystem, regardless of encoding.

 I think this is misleading. With Mono 2.0.1, I get


This has nothing to do with how Mono quotes.  The reason for this is that
Mono quotes at all and that the Mono developers decided not to change
System.IO to understand UNIX quoting.

If Mono used PEP 383 quoting, this would fail the same way.

And analogous failures will exist with PEP 383 in Python, because there will
be more and more libraries with unicode interfaces that then use their own
internal decoder (which doesn't understand utf8b) to get a UNIX file name.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 This has nothing to do with how Mono quotes.  The reason for this is
 that Mono quotes at all and that the Mono developers decided not to
 change System.IO to understand UNIX quoting. 
 
 If Mono used PEP 383 quoting, this would fail the same way. 
 
 And analogous failures will exist with PEP 383 in Python, because there
 will be more and more libraries with unicode interfaces that then use
 their own internal decoder (which doesn't understand utf8b) to get a
 UNIX file name.

What's an analogous failure? Or, rather, why would a failure analogous
to the one I got when using System.IO.DirectoryInfo ever exist in
Python?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Aahz

[top-posting for once to preserve full quoting]

Glenn,

Could you please reduce your suggestions into sample text for the PEP?
We seem to be now at the stage where nobody is objecting to the PEP, so
the focus should be on making the PEP clearer.

If you still want to create an alternative PEP implementation, please
provide step-by-step walkthroughs, preferably in a new thread -- if you
did previously provide that, it's gotten lost in the flood of messages.

On Thu, Apr 30, 2009, Glenn Linderman wrote:
 On approximately 4/29/2009 8:46 PM, came the following characters from  
 the keyboard of Terry Reedy:
 Glenn Linderman wrote:
 On approximately 4/29/2009 1:28 PM, came the following characters 
 from 

 So where is the ambiguity here?

 None.  But not everyone can read all the Python source code to try to 
 understand it; they expect the documentation to help them avoid that. 
 Because the documentation is lacking in this area, it makes your  
 concisely stated PEP rather hard to understand.

 If you think a section of the doc is grossly inadequate, and there is 
 no existing issue on the tracker, feel free to add one.

 Thanks for clarifying the Windows behavior, here.  A little more  
 clarification in the PEP could have avoided lots of discussion.  It  
 would seem that a PEP, proposed to modify a poorly documented (and  
 therefore likely poorly understood) area, should be educational about 
 the status quo, as well as presenting the suggested change.

 Where the PEP proposes to change, it should start with the status quo.  
 But Martin's somewhat reasonable position is that since he is not  
 proposing to change behavior on Windows, it is not his responsibility 
 to document what he is not proposing to change more adequately.  This  
 means, of course, that any observed change on Windows would then be a  
 bug, or at least a break of the promise.  On the other hand, I can see  
 that this is enough related to what he is proposing to change that  
 better doc would help.


 Yes; the very fact that the PEP discusses Windows, speaks about  
 cross-platform code, and doesn't explicitly state that no Windows  
 functionality will change, is confusing.

 An example of how to initialize things within a sample cross-platform  
 application might help, especially if that initialization only happens  
 if the platform is POSIX, or is commented to the effect that it has no  
 effect on Windows, but makes POSIX happy.  Or maybe it is all buried  
 within the initialization of Python itself, and is not exposed to the  
 application at all.  I still haven't figured that out, but was not (and  
 am still not) as concerned about that as ensuring that the overall  
 algorithms are functional and useful and user-friendly.  Showing it  
 might have been helpful in making it clear that no Windows functionality  
 would change, however.

 A statement that additional features are being added to allow  
 cross-platform programs deal with non-decodable bytes obtained from  
 POSIX APIs using the same code that already works on Windows, would have  
 made things much clearer.  The present Abstract does, in fact, talk only  
 about POSIX, but later statements about Windows muddy the water.

 Rationale paragraph 3, explicitly talks about cross-platform programs  
 needing to work one way on Windows and another way on POSIX to deal with  
 all the cases.  It calls that a proposal, which I guess it is for  
 command line and environment, but it is already implemented in both  
 bytes and str forms for file names... so that further muddies the water.

 It is, of course, easier to point out deficiencies in a document than to  
 write a better document; however, it is incumbent upon the PEP author to  
 write a PEP that is good enough to get approved, and that means making  
 it understandable enough that people are in favor... or to respond to  
 the plethora of comments until people are in favor.  I'm not sure which  
 one is more time-consuming.

 I've reached the point, based on PEP and comment responses, where I now  
 believe that the PEP is a solution to the problem it is trying to solve,  
 and doesn't create ambiguities in the naming.  I don't believe it is the  
 best solution.

 The basic problem is the overuse of fake characters... normalizing them  
 for display results is large data loss -- many characters would be  
 translated to the same replacement characters.

 Solutions exist that would allow the use of fewer different fake  
 characters in the strings, while still having a fake character as the  
 escape character, to preserve the invariant that all the strings  
 manipulated by python-escape from the PEP were, and become, strings  
 containing fake characters (from a strict Unicode perspective), which is  
 a nice invariant*.  There even exist solutions that would use only one  
 fake character (repeatedly if necessary), and all other characters  
 generated would be displayable characters.  This would ease

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread MRAB


Martin v. Löwis wrote:

OK, so why not adopt the Mono solution in CPython?  It seems to produce
valid unicode strings, removing at least one issue with PEP 383.  It
also means that IronPython and CPython actually would be compatible.


See my other message. The Mono solution may not be what you expect it to be.


Have we considered discussing the problem with the developers and users
of the other languages to reach a common solution?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Stephen Hansen


 You can't even print them without getting an error from Python.  In fact,
 you also can't print strings containing the proposed half-surrogate
 encodings either: in both cases, the output encoder rejects them with a
 UnicodeEncodeError.   (If not even Python, with its generally lenient
 attitude, can print those things, some other libraries probably will fail,
 too.)


I think you may be confusing two completely separate things; its a
long-known issue that the windows console is simply not a Unicode-aware
display device naturally. You have to manually set the codepage (by typing
'chcp 65001' -- that's utf8) *and* manually make sure you have a
unicode-enabled font chosen for it (which for console fonts is extremely
limited to none, and last I looked the default font didn't support unicode)
before you can even try to successfully print valid unicode. The default
codepage is 437 (for me at least; I think it depends on which language of
Windows you're using) which is ASCII-/ish/.

You have to do your test in an environment which actually supports
displaying unicode at all, or its meaningless.

Personally and for all the use cases I have to deal with at work, I would
/love/ to see this PEP succeed. Being able to query a list of files in a
directory and get them -all-, display them all to a user
(which necessitates it being converted to unicode one way or the other. I
don't care if certain characters don't display: as long as any arbitrary
file will always end up looking like a distinct series of readable and
unreadable glyphs so the user can select it clearly), and then perform
operations on any selected file regardless of whatever nonsense may be going
on underneath with confused users and encodings... in a cross-platform way,
would be a tremendous boon to future py3k porting efforts. I ramble.

If there's inconsistent encodings used by users on a posix system so that
they can only make sense of half of what the names really are... that's for
other programs to deal with. I just want to be able to access the files they
tell me they want.

For anyone who is doing something low-level, they can use the bytes API.

--Stephen
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Guido van Rossum

FWIW, I'm in agreement with this PEP (i.e. its status is now
Accepted). Martin, you can update the PEP and start the
implementation.

On Thu, Apr 30, 2009 at 2:12 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 Did you use a name with other characters?  Were they displayed?  Both
 before and after the surrogates?

 Yes, yes, and yes (IOW, I put the surrogate in the middle).

 Did you use one or three half surrogates, to produce the three crossed
 boxes?

 Only one, and it produced three boxes - probably one for each UTF-8 byte
 that pango considered invalid.

 Did you use one or three half surrogates, to produce the single square box?

 Again, only one. Apparently, PyQt passes the Python Unicode string to Qt
 in a character-by-character representation, rather than going through UTF-8.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel


 What's an analogous failure? Or, rather, why would a failure analogous
 to the one I got when using System.IO.DirectoryInfo ever exist in
 Python?


Mono.Unix uses an encoder and a decoder that knows about special quoting
rules.  System.IO uses a different encoder and decoder because it's a
reimplementation of a Microsoft library and the Mono developers chose not to
implement Mono.Unix quoting rules in it.  There is nothing technical
preventing System.IO from using the Mono.Unix codec, it's just that the
developers didn't want to change the behavior of an ECMA and Microsoft
library.

The analogous phenomenon will exist in Python with PEP 383.  Let's say I
have a C library with wide character interfaces and I pass it a unicode
string from Python.(*)  That C library now turns that unicode string into
UTF-8 for writing to disk using its internal UTF-8 converter.   The result
is that the file can be opened using Python's open, but it can't be opened
using the other library.  There simply is no way you can guarantee that all
libraries turn unicode strings into pathnames using utf-8b.   I'm not
arguing about whether that's good or bad anymore, since it's obvious that
the only proposal acceptable to Guido uses some form of non-standard
encoding / quoting.

I'm simply pointing out that the failure you observed with System.IO has
nothing to do with which quoting convention you choose, but results from the
fact that the developers of System.IO are not using the same encoder/decoder
as Mono.Unix (in that case, by choice).

So, I don't see any reason to prefer your half surrogate quoting to the Mono
U+-based quoting.  Both seem to achieve the same goal with respect to
round tripping file names, displaying them, etc., but Mono quoting actually
results in valid unicode strings.  It works because null is the one
character that's not legal in a UNIX path name.

So, why do you prefer half surrogate coding to U+ quoting?

Tom

(*) There's actually a second, sutble issue.  PEP 383 intends utf-8b only to
be used for file names.  But that means that I might have to bind the first
argument to TIFFOpen with utf-8b conversion, while I might have to bind
other arguments with utf-8 conversion.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread glyph


On 02:42 pm, tmb...@gmail.com wrote:

So, why do you prefer half surrogate coding to U+ quoting?


I have also been eagerly waiting for an answer to this question.  I am 
afraid I have lost it somewhere in the storm of this thread :).


Martin, if you're going to stick with the half-surrogate trick, would 
you mind adding a section to the PEP on alternate encoding strategies, 
explaining why the NULL method was not selected?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Paul Moore

2009/4/30 Thomas Breuel tmb...@gmail.com:
 The analogous phenomenon will exist in Python with PEP 383.  Let's say I
 have a C library with wide character interfaces and I pass it a unicode
 string from Python.(*)
[...]
 (*) There's actually a second, sutble issue.  PEP 383 intends utf-8b only to
 be used for file names.  But that means that I might have to bind the first
 argument to TIFFOpen with utf-8b conversion, while I might have to bind
 other arguments with utf-8 conversion.

The footnote seems to imply that you have a concrete case rather than
a hypothetical one. The discussion would be much easier if you would
supply the concrete details. Then other participants in the discussion
could offer concrete suggestions on how your issue could be addressed.

Of course, there are 2 provisos here:

1. Maybe you don't care any more, having accepted that the PEP is
going to be implemented. That's fine, but there's also no point
continuing to argue your case in that event.
2. Maybe you aren't going to accept suggestions that don't conform to
your idea of how things should be done. In which case, your reasoning
is circular, and you're wasting people's time.

Sorry, that sounds grumpy. But I get a headache at the best of times
trying to understand Unicode issues, and theoretical, vague,
descriptions of problems just make my headache worse...

I suggest the discussion should be dropped now, as the PEP has been accepted.
Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 What's an analogous failure? Or, rather, why would a failure analogous
 to the one I got when using System.IO.DirectoryInfo ever exist in
 Python?
 
 
 Mono.Unix uses an encoder and a decoder that knows about special quoting
 rules.  System.IO uses a different encoder and decoder because it's a
 reimplementation of a Microsoft library and the Mono developers chose
 not to implement Mono.Unix quoting rules in it.  There is nothing
 technical preventing System.IO from using the Mono.Unix codec, it's just
 that the developers didn't want to change the behavior of an ECMA and
 Microsoft library.
 
 The analogous phenomenon will exist in Python with PEP 383.  Let's say I
 have a C library with wide character interfaces and I pass it a unicode
 string from Python.(*)  That C library now turns that unicode string
 into UTF-8 for writing to disk using its internal UTF-8 converter.

What specific library do you have in mind? Would it always use UTF-8?
If so, it will fail in many other ways, as well - if the locale charset
is different from UTF-8.

I fail to see the analogy. In Python, the standard library works,
and the extension fails; in Mono, it's actually vice versa, and not
at all analogous.

 So, I don't see any reason to prefer your half surrogate quoting to the
 Mono U+-based quoting.  Both seem to achieve the same goal with
 respect to round tripping file names, displaying them, etc., but Mono
 quoting actually results in valid unicode strings.  It works because
 null is the one character that's not legal in a UNIX path name.
 
 So, why do you prefer half surrogate coding to U+ quoting?

If I pass a string with an embedded U+ to gtk, gtk will truncate
the string, and stop rendering it at this character. This is worse than
what it does for invalid UTF-8 sequences. Chances are fairly high that
other C libraries will fail in the same way, in particular if they
expect char* (which is very common in C).

So I prefer the half surrogate because its failure mode is better th

 (*) There's actually a second, sutble issue.  PEP 383 intends utf-8b
 only to be used for file names.  But that means that I might have to
 bind the first argument to TIFFOpen with utf-8b conversion, while I
 might have to bind other arguments with utf-8 conversion.

I couldn't find a Python wrapper for libtiff. If a wrapper was written,
it would indeed have to use the file system encoding for the file name
parameters. However, it would have to do that even without PEP 383,
since the file name should be encoded in the locale's encoding, not
in UTF-8, anyway.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Michael Urman

On Thu, Apr 30, 2009 at 09:42, Thomas Breuel tmb...@gmail.com wrote:
 So, I don't see any reason to prefer your half surrogate quoting to the Mono
 U+-based quoting.  Both seem to achieve the same goal with respect to
 round tripping file names, displaying them, etc., but Mono quoting actually
 results in valid unicode strings.  It works because null is the one
 character that's not legal in a UNIX path name.

This seems to summarize only half of the problem. Mono's U+
quoting creates a string which is an invalid filename; PEP 383's
creates one which is an unsanctioned collection of code units. Neither
can be passed directly to the posix filesystem in question. I favor
PEP 383 because its Unicode strings can be usefully passed to most
APIs that would display it usefully. Mono's U+ probably truncates
most strings. And since such non-valid Unicode strings can occur on
the Windows filesystem, I don't find their use in PEP 383 to be a
flaw.

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 Martin, if you're going to stick with the half-surrogate trick, would
 you mind adding a section to the PEP on alternate encoding strategies,
 explaining why the NULL method was not selected?

In the PEP process, it isn't my job to criticize competing proposals.
Instead, proponents of competing proposals should write alternative
PEPs, which then get criticized on their own. As the PEP author, I would
have to collect the objections to the PEP in the PEP, which I did;
I'm not convinced that I would have to also collect all alternative
proposals that people come up with in the PEP (except when they are in
fact amendments that I accept).

I hope I had made it clear that I don't try to shoot down alternative
proposals, but have rather asked people making alternative proposals
to write their own PEPs. At some point (when the amount of alternative
proposals grew unreasonably), I stopped responding to each and every
alternative proposal that this should be proposed in a separate PEP.

Wrt. escaping with U+: I personally disliked it because I considered
it difficult to implement. In particular, on encoding: how do you
arrange the encoder not to encode the NUL character in the encoding, as
it would surely be a valid character? The surrogate approach works
much better here, as it will automatically invoke the error handler.

With further testing, I found that in practice, the proposal also
suffers from the problem that the character would be taken as a
terminating character by APIs - I found that to be a real problem
in gtk, and have added that to the PEP.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread glyph


On 03:35 pm, mar...@v.loewis.de wrote:

So, why do you prefer half surrogate coding to U+ quoting?


If I pass a string with an embedded U+ to gtk, gtk will truncate
the string, and stop rendering it at this character. This is worse than
what it does for invalid UTF-8 sequences. Chances are fairly high that
other C libraries will fail in the same way, in particular if they
expect char* (which is very common in C).


Hmm.  I believe the intended failure mode here, for PyGTK at least, is 
actually this:


   TypeError: GtkLabel.set_text() argument 1 must be string without null 
bytes, not unicode


APIs in PyGTK which accept NULLs and silently trucate are probably 
broken.  Although perhaps I've just made your point even more strongly; 
one because the behavior is inconsistent, and two because it sometimes 
raises an exception if a NULL is present, and apparently the goal here 
is to prevent exceptions from being raised anywhere in the process.


For this idiom to be of any use to GTK programs, 
gtk.FileChooser.get_filename() will probably need to be changed, since 
(in py2) it currently returns a str, not unicode.


The PEP should say something about how GUI libraries should handle file 
choosers, so that they'll be consistent and compatible with the standard 
library.  Perhaps only that file choosers need to take this PEP into 
account, and the rest is obvious.  Or maybe the right thing for GTK to 
do would be to continue to use bytes on POSIX and convert to text on 
Windows, since open(), listdir() et. al. will continue to accept bytes 
for filenames?

So I prefer the half surrogate because its failure mode is better th


Heh heh heh.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread glyph



On 04:07 pm, mar...@v.loewis.de wrote:

Martin, if you're going to stick with the half-surrogate trick, would
you mind adding a section to the PEP on alternate encoding 
strategies,

explaining why the NULL method was not selected?


In the PEP process, it isn't my job to criticize competing proposals.
Instead, proponents of competing proposals should write alternative
PEPs, which then get criticized on their own. As the PEP author, I 
would

have to collect the objections to the PEP in the PEP, which I did;
I'm not convinced that I would have to also collect all alternative
proposals that people come up with in the PEP (except when they are in
fact amendments that I accept).


Fair enough.  I have probably misunderstood the process.  I dimly 
recalled reading some PEPs which addressed alternate approaches in this 
way and I thought it was part of the process.


Anyway, congratulations on getting the PEP accepted, good luck with the 
implementation.  Thanks for addressing my question.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Martin v. Löwis

 If I pass a string with an embedded U+ to gtk, gtk will truncate
 the string, and stop rendering it at this character. This is worse than
 what it does for invalid UTF-8 sequences. Chances are fairly high that
 other C libraries will fail in the same way, in particular if they
 expect char* (which is very common in C).
 
 Hmm.  I believe the intended failure mode here, for PyGTK at least, is
 actually this:
 
TypeError: GtkLabel.set_text() argument 1 must be string without null
 bytes, not unicode

It may depend on the widget also, I tried it with wxMessageDialog
(I only had the wx example available, and am using wxgtk).

 APIs in PyGTK which accept NULLs and silently trucate are probably
 broken.  Although perhaps I've just made your point even more strongly;
 one because the behavior is inconsistent, and two because it sometimes
 raises an exception if a NULL is present, and apparently the goal here
 is to prevent exceptions from being raised anywhere in the process.

Indeed so.

 For this idiom to be of any use to GTK programs,
 gtk.FileChooser.get_filename() will probably need to be changed, since
 (in py2) it currently returns a str, not unicode.

Perhaps - the entire PEP is about Python 3 only. I don't know whether
PyGTK already works with 3.x.

 The PEP should say something about how GUI libraries should handle file
 choosers, so that they'll be consistent and compatible with the standard
 library.  Perhaps only that file choosers need to take this PEP into
 account, and the rest is obvious.  Or maybe the right thing for GTK to
 do would be to continue to use bytes on POSIX and convert to text on
 Windows, since open(), listdir() et. al. will continue to accept bytes
 for filenames?

In Python 3, the file chooser should definitely return strings, and it
would be good if they were PEP 383 compliant.

 So I prefer the half surrogate because its failure mode is better th
 
 Heh heh heh.

And it wasn't even intentional :-)

Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Stephen J. Turnbull

Cameron Simpson writes:
  On 29Apr2009 22:14, Stephen J. Turnbull step...@xemacs.org wrote:
  | Baptiste Carvello writes:
  |   By contrast, if the new utf-8b codec would *supercede* the old one,
  |   \udcxx would always mean raw bytes (at least on UCS-4 builds, where
  |   surrogates are unused). Thus ambiguity could be avoided.
  | 
  | Unfortunately, that's false.  [Because Python strings are
  | intended to be used as containers for widechars which are to be
  | interpreted as Unicode when that makes sense, but there's no
  | restriction against nonsense code points, including in UCS-4
  | Python.]

[...]

  Wouldn't you then be bypassing the implicit encoding anyway, at least to
  some extent, and thus not trip over the PEP?

Sure.  I'm not really arguing the PEP here; the point is that under
the current definition of Python strings, ambiguity is unavoidable.
The best we can ask for is fewer exceptions, and an attempt to reduce
ambiguity to a bare minimum in the code paths that we open up when we
make definition that allows a formerly erroneous computation to
succeed.

Martin is well aware of this, the PEP is clear enough about that (to
me, but I'm a mail and multilingual editor internals kinda guywink).
I'd rather have more validation of strings, but *shrug* Martin's doing
the work.

OTOH, the Unicode fans need to understand that past policy of Python
is not to validate; Python is intended to provide all the tools needed
to write validating apps, but it isn't one itself.  Martin's PEP is
quite narrow in that sense.  All it is about is an invertible encoding
of broken encodings.  It does have the downside that it guarantees
that Python itself can produce non-conforming strings, but that's not
the end of the world, and an app can keep track of them or even refuse
them by setting the error handler, if it wants to.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread David Ripton

On 2009.04.30 18:21:03 +0200, Martin v. Löwis wrote:
 Perhaps - the entire PEP is about Python 3 only. I don't know whether
 PyGTK already works with 3.x.

It does not.  There is a bug in the Gnome tracker for it, and I believe
some work has been done to start porting PyGObject, but it appears that
a full PyGTK on Python 3 is a ways off.

-- 
David Riptondrip...@ripton.net
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB


One further question: should the encoder accept a string like
u'\xDCC2\xDC80'? That would encode to b'\xC2\x80', which, when decoded,
would give u'\x80'. Does the PEP only guarantee that strings decoded
from the filesystem are reversible, but not check what might be de novo
strings?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] #!/usr/bin/env python -- python3 where applicable

2009-04-30 Thread Jim Jewett

Jared Grubb wrote:

 Ok, so if I understand, the situation is:
 * python points to 2.x version
 * python3 points to 3.x version
 * need to be able to run certain 3k scripts from cmdline (since we're
talking about shebangs) using Python3k even though python
points to  2.x

 So, if I got the situation right, then do these same scripts
 understand that PYTHONPATH and PYTHONHOME and all the others
 are also  probably pointing to 2.x code?

Would it make sense to introduce PYTHON2PATH and PYTHON3PATH (or even
PYTHON27PATH and PYTHON 32PATH) et al?

Or is this an area where we just figure that whoever moved the file
locations around for distribution can hardcode things properly?

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

MRAB wrote:
 One further question: should the encoder accept a string like
 u'\xDCC2\xDC80'? That would encode to b'\xC2\x80'

Indeed so.

 which, when decoded, would give u'\x80'.

Assuming the encoding is UTF-8, yes.

 Does the PEP only guarantee that strings decoded
 from the filesystem are reversible, but not check what might be de novo
 strings?

Exactly so.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Mike Klaas



On 30-Apr-09, at 7:39 AM, Guido van Rossum wrote:


FWIW, I'm in agreement with this PEP (i.e. its status is now
Accepted). Martin, you can update the PEP and start the
implementation.


+1

Kudos to Martin for seeing this through with (imo) considerable  
patience and dignity.


-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-04-30 Thread Larry Hastings




Counting the votes for http://bugs.python.org/issue5799 :

   +1 from Mark Hammond (via private mail)
   +1 from Paul Moore (via the tracker)
   +1 from Tim Golden (in Python-ideas, though what he literally said
   was I'm up for it)
   +1 from Michael Foord
   +1 from Eric Smith

There have been no other votes.

Is that enough consensus for it to go in?  If so, are there any core 
developers who could help me get it in before the 3.1 feature freeze?  
The patch should be in good shape; it has unit tests and updated 
documentation.



/larry/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Piet van Oostrum

 Ronald Oussoren ronaldousso...@mac.com (RO) wrote:

RO For what it's worth, the OSX API's seem to behave as follows:
RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the
RO system automaticly encodes the name.

RO That is,  open(chr(255), 'w') will silently create a file named '%FF'
RO instead of the name you'd expect on a unix system.

Not for me (I am using Python 2.6.2).

 f = open(chr(255), 'w')
Traceback (most recent call last):
  File stdin, line 1, in module
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 

I once got a tar file from a Linux system which contained a file with a
non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be
unpacked on a HFS+ filesystem.
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott



On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:


How do get a printable unicode version of these path strings if they
contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?





I'm guessing that an app has to understand that filenames come in  
two forms
unicode and bytes if its not utf-8 data. Why not simply return  
string if

its valid utf-8 otherwise return bytes?


That would have been an alternative solution, and the one that 2.x  
uses

for listdir. People didn't like it.


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-04-30 Thread MRAB


Larry Hastings wrote:



Counting the votes for http://bugs.python.org/issue5799 :

   +1 from Mark Hammond (via private mail)
   +1 from Paul Moore (via the tracker)
   +1 from Tim Golden (in Python-ideas, though what he literally said
   was I'm up for it)
   +1 from Michael Foord
   +1 from Eric Smith

There have been no other votes.

Is that enough consensus for it to go in?  If so, are there any core 
developers who could help me get it in before the 3.1 feature freeze?  
The patch should be in good shape; it has unit tests and updated 
documentation.



+1 from me.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ned Deily

In article m2ocueq6mm@cs.uu.nl, Piet van Oostrum p...@cs.uu.nl 
wrote:
  Ronald Oussoren ronaldousso...@mac.com (RO) wrote:
 RO For what it's worth, the OSX API's seem to behave as follows:
 RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the
 RO system automaticly encodes the name.
 
 RO That is,  open(chr(255), 'w') will silently create a file named '%FF'
 RO instead of the name you'd expect on a unix system.
 
 Not for me (I am using Python 2.6.2).
 
  f = open(chr(255), 'w')
 Traceback (most recent call last):
   File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
  

What version of OSX are you using?  On Tiger 10.4.11 I see the failure 
you see but on Leopard 10.5.6 the behavior Ronald reports.

-- 
 Ned Deily,
 n...@acm.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

 How do get a printable unicode version of these path strings if they
 contain none unicode data?

 Define printable. One way would be to use a regular expression,
 replacing all codes in a certain range with a question mark.
 
 What I mean by printable is that the string must be valid unicode
 that I can print to a UTF-8 console or place as text in a UTF-8
 web page.
 
 I think your PEP gives me a string that will not encode to
 valid UTF-8 that the outside of python world likes. Did I get this
 point wrong?

You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
  return 

This will always return a printable version of the input string...

 In our application we are running fedora with the assumption that the
 filenames are UTF-8. When Windows systems FTP files to our system
 the files are in CP-1251(?) and not valid UTF-8.

That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.

 Having an algorithm that says if its a string no problem, if its
 a byte deal with the exceptions seems simple.
 
 How do I do this detection with the PEP proposal?
 Do I end up using the byte interface and doing the utf-8 decode
 myself?

No, you should encode using the strict error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB


Barry Scott wrote:


On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:


How do get a printable unicode version of these path strings if they
contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?





I'm guessing that an app has to understand that filenames come in two 
forms

unicode and bytes if its not utf-8 data. Why not simply return string if
its valid utf-8 otherwise return bytes?


That would have been an alternative solution, and the one that 2.x uses
for listdir. People didn't like it.


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?


What do you do currently?

The PEP just offers a way of reading all filenames as Unicode, if that's
what you want. So what if the strings can't be encoded to normal UTF-8!
The filenames aren't valid UTF-8 anyway! :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread James Y Knight


On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?


Yes. The practical upshot of this is that users who brokenly use  
ja_JP.SJIS as their locale (which, note, first requires editing some  
files in /var/lib/locales manually to enable its use..) may still have  
python not work with invalid-in-shift-jis filenames. Since that locale  
is widely recognized as a bad idea to use, and is not supported by any  
distros, it certainly doesn't bother me that it isn't 100% supported  
in python. It seems like the most common reason why people want to use  
SJIS is to make old pre-unicode apps work right in WINE -- in which  
case it doesn't actually affect unix python at all.


I'd personally be fine with python just declaring that the filesystem- 
encoding will *always* be utf-8b and ignore the locale...but I expect  
some other people might complain about that. Of course, application  
authors can decide to do that themselves by calling  
sys.setfilesystemencoding('utf-8b') at the start of their program.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel


 Not for me (I am using Python 2.6.2).

  f = open(chr(255), 'w')
 Traceback (most recent call last):
  File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 


You can get the same error on Linux:

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
Traceback (most recent call last):
  File stdin, line 1, in module
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'


(Some file system drivers do not enforce valid utf8 yet, but I suspect they
will in the future.)

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott



On 30 Apr 2009, at 21:06, Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if  
they

contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?


You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
 return 


Ha ha! Indeed this works, but I would have to try to turn enough of the
string into a reasonable hint at the name of the file so the user can
some chance of know what is being reported.




This will always return a printable version of the input string...


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.


That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.


Not a bug its the lack of a feature. We use ProFTPd that has just  
implemented
what is required. I forget the exact details - they are at work - when  
the ftp client
asks for the FEAT of the ftp server the server can say use UTF-8.  
Supporting

that in the server was apparently none-trivia.






Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?


No, you should encode using the strict error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.


O.k. I understand.

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] 3.1 beta deferred

2009-04-30 Thread Benjamin Peterson

Hi everyone!
In the interest of letting Martin implement PEP 383 for 3.1, I am
deferring the release of the 3.1 beta until next Wednesday, May 6th.

Thank you,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Terry Reedy


James Y Knight wrote:

On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?


Yes. The practical upshot of this is that users who brokenly use 
ja_JP.SJIS as their locale (which, note, first requires editing some 
files in /var/lib/locales manually to enable its use..) may still have 
python not work with invalid-in-shift-jis filenames. Since that locale 
is widely recognized as a bad idea to use, and is not supported by any 
distros, it certainly doesn't bother me that it isn't 100% supported in 
python. It seems like the most common reason why people want to use SJIS 
is to make old pre-unicode apps work right in WINE -- in which case it 
doesn't actually affect unix python at all.


I'd personally be fine with python just declaring that the 
filesystem-encoding will *always* be utf-8b and ignore the locale...but 
I expect some other people might complain about that. Of course, 
application authors can decide to do that themselves by calling 
sys.setfilesystemencoding('utf-8b') at the start of their program.


It seems to me that the 3.1+ doc set (or wiki) could be usefully 
extended with a How-to on working with filenames.  I am not sure that 
everything useful fits anywhere in particular the ref manuals.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Toshio Kuratomi

Thomas Breuel wrote:
 Not for me (I am using Python 2.6.2).
 
  f = open(chr(255), 'w')
 Traceback (most recent call last):
  File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 
 
 
 You can get the same error on Linux:
 
 $ python
 Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
 [GCC 4.3.3] on linux2
 Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
 Traceback (most recent call last):
   File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'

 
 (Some file system drivers do not enforce valid utf8 yet, but I suspect
 they will in the future.)
 
Do you suspect that from discussing the issue with kernel developers or
reading a thread on lkml?  If not, then you're suspicion seems to be
pretty groundless

The fact that VFAT enforces an encoding does not lend itself to your
argument for two reasons:

1) VFAT is not a Unix filesystem.  It's a filesystem that's compatible
with Windows/DOS.  If Windows and DOS have filesystem encodings, then it
makes sense for that driver to enforce that as well.  Filesystems
intended to be used natively on Linux/Unix do not necessarily make this
design decision.

2) The encoding is specified when mounting the filesystem.  This means
that you can still mix encodings in a number of ways.  If you mount with
an encoding that has full byte coverage, for instance, each user can put
filenames from different encodings on there.  If you mount with utf8 on
a system which uses euc-jp as the default encoding, you can have full
paths that contain a mix of utf-8 and euc-jp.  Etc.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] 3.1 beta deferred

2009-04-30 Thread Terry Reedy


Benjamin Peterson wrote:

Hi everyone!
In the interest of letting Martin implement PEP 383 for 3.1, I am
deferring the release of the 3.1 beta until next Wednesday, May 6th.


That might also give time for Larry Hastngs' UNC path patch.
(and anything else essentially ready ;-)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Proposed: a new function-based C API for declaring Python types

2009-04-30 Thread Gregory P. Smith

On Tue, Apr 28, 2009 at 8:03 PM, Larry Hastings la...@hastings.org wrote:


 EXECUTIVE SUMMARY

 I've written a patch against py3k trunk creating a new function-based
 API for creating  extension types in C.  This allows PyTypeObject to
 become a (mostly) private structure.


 THE PROBLEM

 Here's how you create an extension type using the current API.

  * First, find some code that already has a working type declaration.
   Copy and paste their fifty-line PyTypeObject declaration, then
   hack it up until it looks like what you need.

  * Next--hey!  There *is* no next, you're done.  You can immediately
   create an object using your type and pass it into the Python
   interpreter and it would work fine.  You are encouraged to call
   PyType_Ready(), but this isn't required and it's often skipped.

 This approach causes two problems.

  1) The Python interpreter *must support* and *cannot change*
the PyTypeObject structure, forever.  Any meaningful change to
the structure will break every extension.   This has many
consequences:
  a) Fields that are no longer used must be left in place,
 forever, as ignored placeholders if need be.  Py3k cleaned
 up a lot of these, but it's already picked up a new one
 (tp_compare is now tp_reserved).
  b) Internal implementation details of the type system must
 be public.
  c) The interpreter can't even use a different structure
 internally, because extensions are free to pass in objects
 using PyTypeObjects the interpreter has never seen before.

  2) As a programming interface this lacks a certain gentility.  It
clearly *works*, but it requires programmers to copy and paste
with a large structure mostly containing NULLs, which they must
pick carefully through to change just a few fields.


 THE SOLUTION

 My patch creates a new function-based extension type definition API.
 You create a type by calling PyType_New(), then call various accessor
 functions on the type (PyType_SetString and the like), and when your
 type has been completely populated you must call PyType_Activate()
 to enable it for use.

 With this API available, extension authors no longer need to directly
 see the innards of the PyTypeObject structure.  Well, most of the
 fields anyway.  There are a few shortcut macros in CPython that need
 to continue working for performance reasons, so the tp_flags and
 tp_dealloc fields need to remain publically visible.

 One feature worth mentioning is that the API is type-safe.  Many such
 APIs would have had one generic PyType_SetPointer, taking an
 identifier for the field and a void * for its value, but this would
 have lost type safety.  Another approach would have been to have one
 accessor per field (PyType_SetAddFunction), but this would have
 exploded the number of functions in the API.  My API splits the
 difference: each distinct *type* has its own set of accessors
 (PyType_GetSSizeT) which takes an identifier specifying which
 field you wish to get or set.


 SIDE-EFFECTS OF THE API

 The major change resulting from this API: all PyTypeObjects must now
 be *pointers* rather than static instances.  For example, the external
 declaration of PyType_Type itself changes from this:
   PyAPI_DATA(PyTypeObject) PyType_Type;
 to this:
   PyAPI_DATA(PyTypeObject *) PyType_Type;

 This gives rise to the first headache caused by the API: type casts
 on type objects.  It took me a day and a half to realize that this,
 from Modules/_weakref.c:
   PyModule_AddObject(m, ref,
  (PyObject *) _PyWeakref_RefType);
 really needed to be this:
   PyModule_AddObject(m, ref,
  (PyObject *) _PyWeakref_RefType);

 Hopefully I've already found most of these in CPython itself, but
 this sort of code surely lurks in extensions yet to be touched.

 (Pro-tip: if you're working with this patch, and you see a crash,
 and gdb shows you something like this at the top of the stack:
   #0  0x081056d8 in visit_decref (op=0x8247aa0, data=0x0)
  at Modules/gcmodule.c:323
   323 if (PyObject_IS_GC(op)) {
 your problem is an errant , likely on a type object you're passing
 in to the interpreter.  Think--what did you touch recently?  Or debug
 it by salting your code with calls to collect(NUM_GENERATIONS-1).)


 Another irksome side-effect of the API: because of tp_flags and
 tp_dealloc, I now have two declarations of PyTypeObject.  There's
 the externally-visible one in Include/object.h, which lets external
 parties see tp_dealloc and tp_flags.  Then there's the internal
 one in Objects/typeprivate.h which is the real structure.  Since
 declaring a type twice is a no-no, the external one is gated on
   #ifndef PY_TYPEPRIVATE
 If you're a normal Python extension programmer, you'd include Python.h
 as normal:
   #include Python.h
 Python implementation files that need to see the real PyTypeObject
 structure now look like this:
   #define

Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-04-30 Thread Mark Hammond


Larry Hastings wrote:



Counting the votes for http://bugs.python.org/issue5799 :

   +1 from Mark Hammond (via private mail)
   +1 from Paul Moore (via the tracker)
   +1 from Tim Golden (in Python-ideas, though what he literally said
   was I'm up for it)
   +1 from Michael Foord
   +1 from Eric Smith

There have been no other votes.

Is that enough consensus for it to go in?  If so, are there any core 
developers who could help me get it in before the 3.1 feature freeze?  
The patch should be in good shape; it has unit tests and updated 
documentation.


I've taken the liberty of explicitly CCing Martin just incase he missed 
the thread with all the noise regarding PEP383.


If there are no objections from Martin or anyone else here, please feel 
free to assign it to me (and mail if I haven't taken action by the day 
before the beta freeze...)


Cheers,

Mark


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Steven D'Aprano

On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote:

 You can get the same error on Linux:

 $ python
 Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
 [GCC 4.3.3] on linux2
 Type help, copyright, credits or license for more
 information.

  f=open(chr(255),'w')

 Traceback (most recent call last):
   File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'

Works for me under Fedora using ext3 as the file system.

$ python2.6
Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13)
[GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2
Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
 f.close()
 import os
 os.remove(chr(255))
  

Given that chr(255) is a valid filename on my file system, I would 
consider it a bug if Python couldn't deal with a file with that name.



-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ronald Oussoren



On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote:


Ronald Oussoren ronaldousso...@mac.com (RO) wrote:



RO For what it's worth, the OSX API's seem to behave as follows:
RO * If you create a file with an non-UTF8 name on a HFS+  
filesystem the

RO system automaticly encodes the name.


RO That is,  open(chr(255), 'w') will silently create a file named  
'%FF'

RO instead of the name you'd expect on a unix system.


Not for me (I am using Python 2.6.2).


f = open(chr(255), 'w')

Traceback (most recent call last):
 File stdin, line 1, in module
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'




That's odd. Which version of OSX do you use?

ron...@rivendell-2[0]$ sw_vers
ProductName:Mac OS X
ProductVersion: 10.5.6
BuildVersion:   9G55

[~/testdir]
ron...@rivendell-2[0]$ /usr/bin/python
Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type help, copyright, credits or license for more information.
 import os
 os.listdir('.')
[]
 open(chr(255), 'w').write('x')
 os.listdir('.')
['%FF']


And likewise with python 2.6.1+ (after cleaning the directory):

[~/testdir]
ron...@rivendell-2[0]$ python2.6
Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type help, copyright, credits or license for more information.
 import os
 os.listdir('.')
[]
 open(chr(255), 'w').write('x')
 os.listdir('.')
['%FF']





I once got a tar file from a Linux system which contained a file  
with a

non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be
unpacked on a HFS+ filesystem.
--
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org




smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Zooko O'Whielacronx

Folks:

My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary
binary names from the filesystem and store them so that I can regenerate
the same byte string later, but it also requires that I *know* whether
what I got was a valid string in the expected encoding (which might be
utf-8) or whether it was not and I need to fall back to storing the
bytes.  So far, it looks like PEP 383 doesn't provide both of these
requirements, so I am going to have to continue working-around the
Python API even after PEP 383.  In fact, it might actually increase the
amount of working-around that I have to do.

If I understand correctly, .decode(encoding, 'strict') will not be
changed by PEP 383.  A new error handler is added, so .decode('utf-8',
'python-escape') performs the utf-8b decoding.  Am I right so far?
Therefore if I have a string of bytes, I can attempt to decode it with
'strict', and if that fails I can set the flag showing that it was not a
valid byte string in the expected encoding, and then I can invoke
.decode('utf-8', 'python-escape') on it.  So far, so good.

(Note that I never want to do .decode(expected_encoding,
'python-escape') -- if it wasn't a valid bytestring in the
expected_encoding, then I want to decode it with utf-8b, regardless of
what the expected encoding was.)

Anyway, I can use it like this:

class FName:
def __init__(self, name, failed_decode=False):
self.name = name
self.failed_decode = failed_decode

def fs_to_unicode(bytes):
try:
return FName(bytes.decode(sys.getfilesystemencoding(), 'strict'))
except UnicodeDecodeError:
return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)

And what about unicode-oriented APIs such as os.listdir()?  Uh-oh, the
PEP says that on systems with locale 'utf-8', it will automatically be
changed to 'utf-8b'.  This means I can't reliably find out whether the
entries in the directory *were* named with valid encodings in utf-8?
That's not acceptable for my use case.  I would have to refrain from
using the unicode-oriented os.listdir() on POSIX, and instead do
something like this:

if platform.system() in ('Windows', 'Darwin'):
def listdir(d):
return [FName(n) for n in os.listdir(d)]
elif platform.system() in ('Linux', 'SunOs'):
def listdir(d):
bytesd = d.encode(sys.getfilesystemencoding())
return [fs_to_unicode(n) for n in os.listdir(bytesd)]
else:
raise NotImplementedError(Please classify platform.system() == %s \
as either unicode-safe or unicode-unsafe. % platform.system())

In fact, if 'utf-8' gets automatically converted to 'utf-8b' when
*decoding* as well as encoding, then I would have to change my
fs_to_unicode() function to check for that and make sure to use strict
utf-8 in the first attempt:

def fs_to_unicode(bytes):
fse = sys.getfilesystemencoding()
if fse == 'utf-8b':
fse = 'utf-8'
try:
return FName(bytes.decode(fse, 'strict'))
except UnicodeDecodeError:
return FName(fn.decode('utf-8', 'python-escape'),
 failed_decode=True)

Would it be possible for Python unicode objects to have a flag
indicating whether the 'python-escape' error handler was present?  That
would serve the same purpose as my failed_decode flag above, and would
basically allow me to use the Python APIs directory and make all this
work-around code disappear.

Failing that, I can't see any way to use the os.listdir() in its
unicode-oriented mode to satisfy Tahoe's requirements.

If you take the above code and then add the fact that you want to use
the failed_decode flag when *encoding* the d argument to os.listdir(),
then you get this code: [2].

Oh, I just realized that I *could* use the PEP 383 os.listdir(), like
this:

def listdir(d):
fse = sys.getfilesystemencoding()
if fse == 'utf-8b':
fse = 'utf-8'
ns = []
for fn in os.listdir(d):
bytes = fn.encode(fse, 'python-escape')
try:
ns.append(FName(bytes.decode(fse, 'strict')))
except UnicodeDecodeError:
ns.append(FName(fn.decode('utf-8', 'python-escape'),
  failed_decode=True))
return ns

(And I guess I could define listdir() like this only on the
non-unicode-safe platforms, as above.)

However, that strikes me as even more horrible than the previous
listdir() work-around, in part because it means decoding, re-encoding,
and re-decoding every name, so I think I would stick with the previous
version.

Oh, one more note: for Tahoe's purposes you can, in all of the code
above, replace .decode('utf-8', 'python-replace') with
.decode('windows-1252') and it works just as well.  While UTF-8b seems
like a really cool hack, and it would produce more legible results if
utf-8-encoded strings were partially corrupted, I guess I should just
use 'windows-1252' which is already implemented in Python 2 (as well as
in all other software in the world).

I guess this means that PEP 383, which I

72 matches

Mail list logo