subject:"Re\: \[Python\-Dev\] PEP 383\: Non\-decodable Bytes in System Character Interfaces"

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-17 Thread Piet van Oostrum

 Ned Deily n...@acm.org (ND) wrote:

ND In article m2ocueq6mm@cs.uu.nl, Piet van Oostrum p...@cs.uu.nl 
ND wrote:
  Ronald Oussoren ronaldousso...@mac.com (RO) wrote:
 RO For what it's worth, the OSX API's seem to behave as follows:
 RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the
 RO system automaticly encodes the name.
 
 RO That is,  open(chr(255), 'w') will silently create a file named '%FF'
 RO instead of the name you'd expect on a unix system.
 
 Not for me (I am using Python 2.6.2).
 
  f = open(chr(255), 'w')
 Traceback (most recent call last):
 File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
  

ND What version of OSX are you using?  On Tiger 10.4.11 I see the failure 
ND you see but on Leopard 10.5.6 the behavior Ronald reports.

Yes, I am using Tiger (10.4.11). Interesting that it has changed on Leopard.
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-01 Thread Stephen J. Turnbull

James Y Knight writes:

  in python. It seems like the most common reason why people want to use  
  SJIS is to make old pre-unicode apps work right in WINE -- in which  
  case it doesn't actually affect unix python at all.

Mounting external drives, especially USB memory sticks which tend to
be FAT-initialized by the manufacturers, is another common case.

But I don't understand why PEP 383 needs to care at all.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel

On Wed, Apr 29, 2009 at 23:03, Terry Reedy tjre...@udel.edu wrote:

 Thomas Breuel wrote:


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


 Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
 surrogates.


 By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows
 that.


If we use conformance to Unicode 5.1 as the basis for our discussion, then
PEP 383 is off the table anyway.  I'm all for strict Unicode compliance.
But apparently, the Python community doesn't care.

CESU-8 is described in Unicode Technical Report #26, so it at least has some
official recognition.  More importantly, it's also widely used.  So, my
question: what are the implications of PEP 383 for CESU-8 encodings on
Python?

My meta-point is: there are probably many more such issues hidden away and
it is a really bad idea to rush something like PEP 383 out.  Unicode is hard
anyway, and tinkering with its semantics requires a lot of thought.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman

On approximately 4/29/2009 8:46 PM, came the following characters from 
the keyboard of Terry Reedy:

Glenn Linderman wrote:
On approximately 4/29/2009 1:28 PM, came the following characters from 



So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


If you think a section of the doc is grossly inadequate, and there is no 
existing issue on the tracker, feel free to add one.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.


Where the PEP proposes to change, it should start with the status quo. 
But Martin's somewhat reasonable position is that since he is not 
proposing to change behavior on Windows, it is not his responsibility to 
document what he is not proposing to change more adequately.  This 
means, of course, that any observed change on Windows would then be a 
bug, or at least a break of the promise.  On the other hand, I can see 
that this is enough related to what he is proposing to change that 
better doc would help.



Yes; the very fact that the PEP discusses Windows, speaks about 
cross-platform code, and doesn't explicitly state that no Windows 
functionality will change, is confusing.


An example of how to initialize things within a sample cross-platform 
application might help, especially if that initialization only happens 
if the platform is POSIX, or is commented to the effect that it has no 
effect on Windows, but makes POSIX happy.  Or maybe it is all buried 
within the initialization of Python itself, and is not exposed to the 
application at all.  I still haven't figured that out, but was not (and 
am still not) as concerned about that as ensuring that the overall 
algorithms are functional and useful and user-friendly.  Showing it 
might have been helpful in making it clear that no Windows functionality 
would change, however.


A statement that additional features are being added to allow 
cross-platform programs deal with non-decodable bytes obtained from 
POSIX APIs using the same code that already works on Windows, would have 
made things much clearer.  The present Abstract does, in fact, talk only 
about POSIX, but later statements about Windows muddy the water.


Rationale paragraph 3, explicitly talks about cross-platform programs 
needing to work one way on Windows and another way on POSIX to deal with 
all the cases.  It calls that a proposal, which I guess it is for 
command line and environment, but it is already implemented in both 
bytes and str forms for file names... so that further muddies the water.


It is, of course, easier to point out deficiencies in a document than to 
write a better document; however, it is incumbent upon the PEP author to 
write a PEP that is good enough to get approved, and that means making 
it understandable enough that people are in favor... or to respond to 
the plethora of comments until people are in favor.  I'm not sure which 
one is more time-consuming.


I've reached the point, based on PEP and comment responses, where I now 
believe that the PEP is a solution to the problem it is trying to solve, 
and doesn't create ambiguities in the naming.  I don't believe it is the 
best solution.


The basic problem is the overuse of fake characters... normalizing them 
for display results is large data loss -- many characters would be 
translated to the same replacement characters.


Solutions exist that would allow the use of fewer different fake 
characters in the strings, while still having a fake character as the 
escape character, to preserve the invariant that all the strings 
manipulated by python-escape from the PEP were, and become, strings 
containing fake characters (from a strict Unicode perspective), which is 
a nice invariant*.  There even exist solutions that would use only one 
fake character (repeatedly if necessary), and all other characters 
generated would be displayable characters.  This would ease the burden 
on the program in displaying the strings, and also on the user that 
might view the resulting mojibake in trying to differentiate one such 
string from another.  Those are outlined in various emails in this 
thread, although some include my misconception that strings obtained via 
 Unicode-enabled OS APIs would also need to be encoded and altered.  If 
there is any interest in using a more readable encoding, I'd be glad to 
rework them to remove those misconceptions.


* It would be nice to point out that invariant in the PEP, also.


--
Glenn -- http://nevcal.com/
===

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman

On approximately 4/29/2009 7:50 PM, came the following characters from 
the keyboard of Aahz:

On Thu, Apr 30, 2009, Cameron Simpson wrote:

The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
_not_ well-formed unicode (== have bare surrogates in them) but that
were intended for use as filenames will conflict with the PEP's scheme -
programs must know that these strings came from outside and must be
translated into the PEP's funny-encoding before use in the os.*
functions. Previous to the PEP they would get used directly and
encode differently after the PEP, thus producing different POSIX
filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
using a rare-in-filenames character.
That would avoid the issue with outside' strings that contain
surrogates. To my mind it just moves the punning from rare illegal
strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
os.listdir but a subclass of string (or at least a duck-type of
string) that knows where it came from and is also handily
recognisable as not-really-a-string for purposes of deciding
whether is it PEP-funny-encoded by direct inspection.


Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.


I'll agree that once other misconceptions were explained away, that the 
remaining issues are those Cameron summarized.  Thanks for the summary!


Point two could be modified because I've changed my opinion; I like the 
invariant Cameron first (I think) explicitly stated about the PEP as it 
stands, and that I just reworded in another message, that the strings 
that are altered by the PEP in either direction are in the subset of 
strings that contain fake (from a strict Unicode viewpoint) characters. 
 I still think an encoding that uses mostly real characters that have 
assigned glyphs would be better than the encoding in the PEP; but would 
now suggest that an escape character be a fake character.


I'll note here that while the PEP encoding causes illegal bytes to be 
translated to one fake character, the 3-byte sequence that looks like 
the range of fake characters would also be translated to a sequence of 3 
fake characters.  This is 512 combinations that must be translated, and 
understood by the user (or at least by the programmer).  The escape 
sequence approach requires changing only 257 combinations, and each 
altered combination would result in exactly 2 characters.  Hence, this 
seems simpler to understand, and to manually encode and decode for 
debugging purposes.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

 Assuming people agree that this is an accurate summary, it should be
 incorporated into the PEP.

Done!

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

 I think it has to be excluded from mapping in order to not introduce
 security issues.

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Aahz

[top-posting for once to preserve full quoting]

Glenn,

Could you please reduce your suggestions into sample text for the PEP?
We seem to be now at the stage where nobody is objecting to the PEP, so
the focus should be on making the PEP clearer.

If you still want to create an alternative PEP implementation, please
provide step-by-step walkthroughs, preferably in a new thread -- if you
did previously provide that, it's gotten lost in the flood of messages.

On Thu, Apr 30, 2009, Glenn Linderman wrote:
 On approximately 4/29/2009 8:46 PM, came the following characters from  
 the keyboard of Terry Reedy:
 Glenn Linderman wrote:
 On approximately 4/29/2009 1:28 PM, came the following characters 
 from 

 So where is the ambiguity here?

 None.  But not everyone can read all the Python source code to try to 
 understand it; they expect the documentation to help them avoid that. 
 Because the documentation is lacking in this area, it makes your  
 concisely stated PEP rather hard to understand.

 If you think a section of the doc is grossly inadequate, and there is 
 no existing issue on the tracker, feel free to add one.

 Thanks for clarifying the Windows behavior, here.  A little more  
 clarification in the PEP could have avoided lots of discussion.  It  
 would seem that a PEP, proposed to modify a poorly documented (and  
 therefore likely poorly understood) area, should be educational about 
 the status quo, as well as presenting the suggested change.

 Where the PEP proposes to change, it should start with the status quo.  
 But Martin's somewhat reasonable position is that since he is not  
 proposing to change behavior on Windows, it is not his responsibility 
 to document what he is not proposing to change more adequately.  This  
 means, of course, that any observed change on Windows would then be a  
 bug, or at least a break of the promise.  On the other hand, I can see  
 that this is enough related to what he is proposing to change that  
 better doc would help.


 Yes; the very fact that the PEP discusses Windows, speaks about  
 cross-platform code, and doesn't explicitly state that no Windows  
 functionality will change, is confusing.

 An example of how to initialize things within a sample cross-platform  
 application might help, especially if that initialization only happens  
 if the platform is POSIX, or is commented to the effect that it has no  
 effect on Windows, but makes POSIX happy.  Or maybe it is all buried  
 within the initialization of Python itself, and is not exposed to the  
 application at all.  I still haven't figured that out, but was not (and  
 am still not) as concerned about that as ensuring that the overall  
 algorithms are functional and useful and user-friendly.  Showing it  
 might have been helpful in making it clear that no Windows functionality  
 would change, however.

 A statement that additional features are being added to allow  
 cross-platform programs deal with non-decodable bytes obtained from  
 POSIX APIs using the same code that already works on Windows, would have  
 made things much clearer.  The present Abstract does, in fact, talk only  
 about POSIX, but later statements about Windows muddy the water.

 Rationale paragraph 3, explicitly talks about cross-platform programs  
 needing to work one way on Windows and another way on POSIX to deal with  
 all the cases.  It calls that a proposal, which I guess it is for  
 command line and environment, but it is already implemented in both  
 bytes and str forms for file names... so that further muddies the water.

 It is, of course, easier to point out deficiencies in a document than to  
 write a better document; however, it is incumbent upon the PEP author to  
 write a PEP that is good enough to get approved, and that means making  
 it understandable enough that people are in favor... or to respond to  
 the plethora of comments until people are in favor.  I'm not sure which  
 one is more time-consuming.

 I've reached the point, based on PEP and comment responses, where I now  
 believe that the PEP is a solution to the problem it is trying to solve,  
 and doesn't create ambiguities in the naming.  I don't believe it is the  
 best solution.

 The basic problem is the overuse of fake characters... normalizing them  
 for display results is large data loss -- many characters would be  
 translated to the same replacement characters.

 Solutions exist that would allow the use of fewer different fake  
 characters in the strings, while still having a fake character as the  
 escape character, to preserve the invariant that all the strings  
 manipulated by python-escape from the PEP were, and become, strings  
 containing fake characters (from a strict Unicode perspective), which is  
 a nice invariant*.  There even exist solutions that would use only one  
 fake character (repeatedly if necessary), and all other characters  
 generated would be displayable characters.  This would ease

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Stephen J. Turnbull

Cameron Simpson writes:
  On 29Apr2009 22:14, Stephen J. Turnbull step...@xemacs.org wrote:
  | Baptiste Carvello writes:
  |   By contrast, if the new utf-8b codec would *supercede* the old one,
  |   \udcxx would always mean raw bytes (at least on UCS-4 builds, where
  |   surrogates are unused). Thus ambiguity could be avoided.
  | 
  | Unfortunately, that's false.  [Because Python strings are
  | intended to be used as containers for widechars which are to be
  | interpreted as Unicode when that makes sense, but there's no
  | restriction against nonsense code points, including in UCS-4
  | Python.]

[...]

  Wouldn't you then be bypassing the implicit encoding anyway, at least to
  some extent, and thus not trip over the PEP?

Sure.  I'm not really arguing the PEP here; the point is that under
the current definition of Python strings, ambiguity is unavoidable.
The best we can ask for is fewer exceptions, and an attempt to reduce
ambiguity to a bare minimum in the code paths that we open up when we
make definition that allows a formerly erroneous computation to
succeed.

Martin is well aware of this, the PEP is clear enough about that (to
me, but I'm a mail and multilingual editor internals kinda guywink).
I'd rather have more validation of strings, but *shrug* Martin's doing
the work.

OTOH, the Unicode fans need to understand that past policy of Python
is not to validate; Python is intended to provide all the tools needed
to write validating apps, but it isn't one itself.  Martin's PEP is
quite narrow in that sense.  All it is about is an invertible encoding
of broken encodings.  It does have the downside that it guarantees
that Python itself can produce non-conforming strings, but that's not
the end of the world, and an app can keep track of them or even refuse
them by setting the error handler, if it wants to.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB


One further question: should the encoder accept a string like
u'\xDCC2\xDC80'? That would encode to b'\xC2\x80', which, when decoded,
would give u'\x80'. Does the PEP only guarantee that strings decoded
from the filesystem are reversible, but not check what might be de novo
strings?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

MRAB wrote:
 One further question: should the encoder accept a string like
 u'\xDCC2\xDC80'? That would encode to b'\xC2\x80'

Indeed so.

 which, when decoded, would give u'\x80'.

Assuming the encoding is UTF-8, yes.

 Does the PEP only guarantee that strings decoded
 from the filesystem are reversible, but not check what might be de novo
 strings?

Exactly so.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Piet van Oostrum

 Ronald Oussoren ronaldousso...@mac.com (RO) wrote:

RO For what it's worth, the OSX API's seem to behave as follows:
RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the
RO system automaticly encodes the name.

RO That is,  open(chr(255), 'w') will silently create a file named '%FF'
RO instead of the name you'd expect on a unix system.

Not for me (I am using Python 2.6.2).

 f = open(chr(255), 'w')
Traceback (most recent call last):
  File stdin, line 1, in module
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 

I once got a tar file from a Linux system which contained a file with a
non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be
unpacked on a HFS+ filesystem.
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott



On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:


How do get a printable unicode version of these path strings if they
contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?





I'm guessing that an app has to understand that filenames come in  
two forms
unicode and bytes if its not utf-8 data. Why not simply return  
string if

its valid utf-8 otherwise return bytes?


That would have been an alternative solution, and the one that 2.x  
uses

for listdir. People didn't like it.


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ned Deily

In article m2ocueq6mm@cs.uu.nl, Piet van Oostrum p...@cs.uu.nl 
wrote:
  Ronald Oussoren ronaldousso...@mac.com (RO) wrote:
 RO For what it's worth, the OSX API's seem to behave as follows:
 RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the
 RO system automaticly encodes the name.
 
 RO That is,  open(chr(255), 'w') will silently create a file named '%FF'
 RO instead of the name you'd expect on a unix system.
 
 Not for me (I am using Python 2.6.2).
 
  f = open(chr(255), 'w')
 Traceback (most recent call last):
   File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
  

What version of OSX are you using?  On Tiger 10.4.11 I see the failure 
you see but on Leopard 10.5.6 the behavior Ronald reports.

-- 
 Ned Deily,
 n...@acm.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

 How do get a printable unicode version of these path strings if they
 contain none unicode data?

 Define printable. One way would be to use a regular expression,
 replacing all codes in a certain range with a question mark.
 
 What I mean by printable is that the string must be valid unicode
 that I can print to a UTF-8 console or place as text in a UTF-8
 web page.
 
 I think your PEP gives me a string that will not encode to
 valid UTF-8 that the outside of python world likes. Did I get this
 point wrong?

You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
  return 

This will always return a printable version of the input string...

 In our application we are running fedora with the assumption that the
 filenames are UTF-8. When Windows systems FTP files to our system
 the files are in CP-1251(?) and not valid UTF-8.

That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.

 Having an algorithm that says if its a string no problem, if its
 a byte deal with the exceptions seems simple.
 
 How do I do this detection with the PEP proposal?
 Do I end up using the byte interface and doing the utf-8 decode
 myself?

No, you should encode using the strict error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB


Barry Scott wrote:


On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:


How do get a printable unicode version of these path strings if they
contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?





I'm guessing that an app has to understand that filenames come in two 
forms

unicode and bytes if its not utf-8 data. Why not simply return string if
its valid utf-8 otherwise return bytes?


That would have been an alternative solution, and the one that 2.x uses
for listdir. People didn't like it.


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?


What do you do currently?

The PEP just offers a way of reading all filenames as Unicode, if that's
what you want. So what if the strings can't be encoded to normal UTF-8!
The filenames aren't valid UTF-8 anyway! :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread James Y Knight


On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?


Yes. The practical upshot of this is that users who brokenly use  
ja_JP.SJIS as their locale (which, note, first requires editing some  
files in /var/lib/locales manually to enable its use..) may still have  
python not work with invalid-in-shift-jis filenames. Since that locale  
is widely recognized as a bad idea to use, and is not supported by any  
distros, it certainly doesn't bother me that it isn't 100% supported  
in python. It seems like the most common reason why people want to use  
SJIS is to make old pre-unicode apps work right in WINE -- in which  
case it doesn't actually affect unix python at all.


I'd personally be fine with python just declaring that the filesystem- 
encoding will *always* be utf-8b and ignore the locale...but I expect  
some other people might complain about that. Of course, application  
authors can decide to do that themselves by calling  
sys.setfilesystemencoding('utf-8b') at the start of their program.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel


 Not for me (I am using Python 2.6.2).

  f = open(chr(255), 'w')
 Traceback (most recent call last):
  File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 


You can get the same error on Linux:

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
Traceback (most recent call last):
  File stdin, line 1, in module
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'


(Some file system drivers do not enforce valid utf8 yet, but I suspect they
will in the future.)

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott



On 30 Apr 2009, at 21:06, Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if  
they

contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?


You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
 return 


Ha ha! Indeed this works, but I would have to try to turn enough of the
string into a reasonable hint at the name of the file so the user can
some chance of know what is being reported.




This will always return a printable version of the input string...


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.


That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.


Not a bug its the lack of a feature. We use ProFTPd that has just  
implemented
what is required. I forget the exact details - they are at work - when  
the ftp client
asks for the FEAT of the ftp server the server can say use UTF-8.  
Supporting

that in the server was apparently none-trivia.






Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?


No, you should encode using the strict error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.


O.k. I understand.

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Terry Reedy


James Y Knight wrote:

On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?


Yes. The practical upshot of this is that users who brokenly use 
ja_JP.SJIS as their locale (which, note, first requires editing some 
files in /var/lib/locales manually to enable its use..) may still have 
python not work with invalid-in-shift-jis filenames. Since that locale 
is widely recognized as a bad idea to use, and is not supported by any 
distros, it certainly doesn't bother me that it isn't 100% supported in 
python. It seems like the most common reason why people want to use SJIS 
is to make old pre-unicode apps work right in WINE -- in which case it 
doesn't actually affect unix python at all.


I'd personally be fine with python just declaring that the 
filesystem-encoding will *always* be utf-8b and ignore the locale...but 
I expect some other people might complain about that. Of course, 
application authors can decide to do that themselves by calling 
sys.setfilesystemencoding('utf-8b') at the start of their program.


It seems to me that the 3.1+ doc set (or wiki) could be usefully 
extended with a How-to on working with filenames.  I am not sure that 
everything useful fits anywhere in particular the ref manuals.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Toshio Kuratomi

Thomas Breuel wrote:
 Not for me (I am using Python 2.6.2).
 
  f = open(chr(255), 'w')
 Traceback (most recent call last):
  File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 
 
 
 You can get the same error on Linux:
 
 $ python
 Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
 [GCC 4.3.3] on linux2
 Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
 Traceback (most recent call last):
   File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'

 
 (Some file system drivers do not enforce valid utf8 yet, but I suspect
 they will in the future.)
 
Do you suspect that from discussing the issue with kernel developers or
reading a thread on lkml?  If not, then you're suspicion seems to be
pretty groundless

The fact that VFAT enforces an encoding does not lend itself to your
argument for two reasons:

1) VFAT is not a Unix filesystem.  It's a filesystem that's compatible
with Windows/DOS.  If Windows and DOS have filesystem encodings, then it
makes sense for that driver to enforce that as well.  Filesystems
intended to be used natively on Linux/Unix do not necessarily make this
design decision.

2) The encoding is specified when mounting the filesystem.  This means
that you can still mix encodings in a number of ways.  If you mount with
an encoding that has full byte coverage, for instance, each user can put
filenames from different encodings on there.  If you mount with utf8 on
a system which uses euc-jp as the default encoding, you can have full
paths that contain a mix of utf-8 and euc-jp.  Etc.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Steven D'Aprano

On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote:

 You can get the same error on Linux:

 $ python
 Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
 [GCC 4.3.3] on linux2
 Type help, copyright, credits or license for more
 information.

  f=open(chr(255),'w')

 Traceback (most recent call last):
   File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'

Works for me under Fedora using ext3 as the file system.

$ python2.6
Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13)
[GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2
Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
 f.close()
 import os
 os.remove(chr(255))
  

Given that chr(255) is a valid filename on my file system, I would 
consider it a bug if Python couldn't deal with a file with that name.



-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ronald Oussoren



On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote:


Ronald Oussoren ronaldousso...@mac.com (RO) wrote:



RO For what it's worth, the OSX API's seem to behave as follows:
RO * If you create a file with an non-UTF8 name on a HFS+  
filesystem the

RO system automaticly encodes the name.


RO That is,  open(chr(255), 'w') will silently create a file named  
'%FF'

RO instead of the name you'd expect on a unix system.


Not for me (I am using Python 2.6.2).


f = open(chr(255), 'w')

Traceback (most recent call last):
 File stdin, line 1, in module
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'




That's odd. Which version of OSX do you use?

ron...@rivendell-2[0]$ sw_vers
ProductName:Mac OS X
ProductVersion: 10.5.6
BuildVersion:   9G55

[~/testdir]
ron...@rivendell-2[0]$ /usr/bin/python
Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type help, copyright, credits or license for more information.
 import os
 os.listdir('.')
[]
 open(chr(255), 'w').write('x')
 os.listdir('.')
['%FF']


And likewise with python 2.6.1+ (after cleaning the directory):

[~/testdir]
ron...@rivendell-2[0]$ python2.6
Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type help, copyright, credits or license for more information.
 import os
 os.listdir('.')
[]
 open(chr(255), 'w').write('x')
 os.listdir('.')
['%FF']





I once got a tar file from a Linux system which contained a file  
with a

non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be
unpacked on a HFS+ filesystem.
--
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org




smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 The Python UTF-8 codec will happily encode half-surrogates; people argue
 that it is a bug that it does so, however, it would help in this
 specific case.
 
 Can we use this encoding scheme for writing into files as well?  We've
 turned the filename with undecodable bytes into a string with half
 surrogates.  Putting that string into a file has to turn them into bytes
 at some level.  Can we use the python-escape error handler to achieve
 that somehow?

Sure: if you are aware that what you write to the stream is actually
a file name, you should encode it with the file system encoding, and
the python-escape handler. However, it's questionable that the same
approach is right for the rest of the data that goes into the file.

If you use a different encoding on the stream, yet still use the
python-escape handler, you may end up with completely non-sensical
bytes. In practice, it probably won't be that bad - python-escape
has likely escaped all non-ASCII bytes, so that on re-encoding with
a different encoding, only the ASCII characters get encoded, which
likely will work fine.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 I'm more concerned with your (yours? someone else's?) mention of shift
 characters. I'm unfamiliar with these encodings: to translate such a
 thing into a Latin example, is it the case that there are schemes with
 valid encodings that look like:
 
   [SHIFT] a b c
 
 which would produce ABC in unicode, which is ambiguous with:
 
   A B C
 
 which would also produce ABC?

No: the shift in shift-jis is not really about the shift key.
See http://en.wikipedia.org/wiki/Shift-JIS

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/28/2009 10:52 PM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.

Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both
bytes and str interfaces, and both get used.


Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.



No Martin, the point of reviewing the PEP is to _not_ trust you, even 
though you are generally very knowledgeable and very trustworthy.  It is 
much easier to find problems before something is released, or even 
coded, than it is afterwards.




By accessed via the str interface, I assume you do something like

  fn = some string
  open(fn)

You are wrong in assuming no decoding happens, and that matches
in memory the file on disk (whatever that means - how do I match
a file on disk in memory??). What happens instead is that fn
gets *encoded* with the file system encoding, and the python-escape
handler. This will *not* produce an ambiguity.



You assumed, and maybe I wasn't clear in my statement.

By accessed via the str interface I mean that (on Windows) the wide 
string interface would be used to obtain a file name.  Now, suppose that 
the file name returned contains abc followed by the half-surrogate 
U+DC10 -- four 16-bit codes.


Then, ask for the same filename via the bytes interface, using UTF-8 
encoding.  The PEP says that the above name would get translated to 
abc followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes 
used to represent the half-surrogate that is actually in the file name, 
specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can 
be seen as two different names in memory.


Now posit another file which, when accessed via the str interface, has 
the name abc followed by U+DCED U+DCB0 U+DC90.


Looks ambiguous to me.  Now if you have a scheme for handling this case, 
fine, but I don't understand it from what is written in the PEP.




If you think there is an ambiguity in that you can use both the
byte interface and the string interface to access the same file:
this would be a ridiculous interpretation. *Of course* you can
access /etc/passwd both as /etc/passwd and b/etc/passwd,
there is nothing ambiguous about that.


Yes, this would be a ridiculous interpretation of ambiguous.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 C. File on disk with the invalid surrogate code, accessed via the str
 interface, no decoding happens, matches in memory the file on disk
 with
 the byte that translates to the same surrogate, accessed via the bytes
 interface.  Ambiguity.
 Is that an alternative to A and B?
 I guess it is an adjunct to case B, the current PEP.

 It is what happens when using the PEP on a system that provides both
 bytes and str interfaces, and both get used.

 Your formulation is a bit too stenographic to me, but please trust me
 that there is *no* ambiguity in the case you construct.
 
 
 No Martin, the point of reviewing the PEP is to _not_ trust you, even
 though you are generally very knowledgeable and very trustworthy.  It is
 much easier to find problems before something is released, or even
 coded, than it is afterwards.

Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).

 You assumed, and maybe I wasn't clear in my statement.
 
 By accessed via the str interface I mean that (on Windows) the wide
 string interface would be used to obtain a file name.

What does that mean? What specific interface are you referring to to
obtain file names? Most of the time, file names are obtained by the
user entering them on the keyboard. GUI applications are completely
out of the scope of the PEP.

 Now, suppose that
 the file name returned contains abc followed by the half-surrogate
 U+DC10 -- four 16-bit codes.

Ok, so perhaps you might be talking about os.listdir here. Communication
would be much easier if I would not need to guess what you may mean.

Also, why is U+DC10 four 16-bit codes?

 Then, ask for the same filename via the bytes interface, using UTF-8
 encoding.

How do you do that on Windows? You cannot just pick an encoding, such
as UTF-8, and pass that to the byte interface, and expect it to work.
If you use the byte interface, you need to encode in the file system
encoding, of course.

Also, what do you mean by ask for?? WHAT INTERFACE ARE YOU
USING Please use specific python code.

 The PEP says that the above name would get translated to
 abc followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes
 used to represent the half-surrogate that is actually in the file name,
 specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can
 be seen as two different names in memory.

You are relying on false assumptions here, namely that the UTF-8
encoding would play any role.

What would happen instead is that the mbcs encoding would be used. The
mbcs encoding, by design from Microsoft, will never report an error,
so the error handler will not be invoked at all.

 Now posit another file which, when accessed via the str interface, has
 the name abc followed by U+DCED U+DCB0 U+DC90.
 
 Looks ambiguous to me.  Now if you have a scheme for handling this case,
 fine, but I don't understand it from what is written in the PEP.

You were just making false assumptions in your reasoning, assumptions
that are way beyond the scope of the PEP.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Glenn Linderman a écrit :


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.




The problem with this strategy is: paths are often sliced, so your 2 codepoints 
could get separated. The good thing with the PEP's strategy is that 1 character 
stays 1 character.


Baptiste

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Hrvoje Niksic


Zooko O'Whielacronx wrote:
If you switch to iso8859-15 only in the presence of undecodable  
UTF-8, then you have the same round-trip problem as the PEP: both  
b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a  
way to unambiguously recover the original file name.


Why do you say that?  It seems to work as I expected here:

  '\xff'.decode('iso-8859-15')
u'\xff'
  '\xc3\xbf'.decode('iso-8859-15')
u'\xc3\xbf'


Here is what I mean by switch to iso8859-15 only in the presence of 
undecodable UTF-8:


def file_name_to_unicode(fn, encoding):
try:
return fn.decode(encoding)
except UnicodeDecodeError:
return fn.decode('iso-8859-15')

Now, assume a UTF-8 locale and try to use it on the provided example 
file names.


 file_name_to_unicode(b'\xff', 'utf-8')
'ÿ'
 file_name_to_unicode(b'\xc3\xbf', 'utf-8')
'ÿ'

That is the ambiguity I was referring to -- to different byte sequences 
result in the same unicode string.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Lino Mastrodomenico a écrit :


Only for the new utf-8b encoding (if Martin agrees), while the
existing utf-8 is fine as is (or at least waaay outside the scope of
this PEP).



This is questionable. This would have the consequence that \udcxx in a python 
string would sometimes mean a surrogate, and sometimes mean raw bytes, depending 
on the history of the string.


By contrast, if the new utf-8b codec would *supercede* the old one, \udcxx would 
always mean raw bytes (at least on UCS-4 builds, where surrogates are unused). 
Thus ambiguity could be avoided.


Baptiste

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Glenn Linderman a écrit :



If there is going to be a required transformation from de novo strings 
to funny-encoded strings, then why not make one that people can actually 
see and compare and decode from the displayable form, by using 
displayable characters instead of lone surrogates?




The problem with your escape character scheme is that the meaning is lost with 
slicing of the strings, which is a very common operation.




I though half-surrogates were illegal in well formed Unicode. I confess
to being weak in this area. By legitimate above I meant things like
half-surrogates which, like quarks, should not occur alone?
  


Illegal just means violating the accepted rules.  In this case, the 
accepted rules are those enforced by the file system (at the bytes or 
str API levels), and by Python (for the str manipulations).  None of 
those rules outlaw lone surrogates.  [...]




Python could as well *specify* that lone surrogates are illegal, as their 
meaning is undefined by Unicode. If this rule is respected language-wise, there 
is no ambiguity. It might be unrealistic on windows, though.


This rule could even be specified only for strings that represent filesystem 
paths. Sure, they are the same type as other strings, but the programmer usually 
knows if a given string is intended to be a path or not.


Baptiste

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel

 Sure. However, that requires you to provide meaningful, reproducible
 counter-examples, rather than a stenographic formulation that might
 hint some problem you apparently see (which I believe is just not
 there).


Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
surrogates.  But such encodings are currently supported by Python, and they
are used as part of CESU-8 coding.  That's, in fact, a common way of
converting UTF-16 to UTF-8.  How are you going to deal with existing code
that relies on being able to code half surrogates as UTF-8?

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 12:38 AM, came the following characters from 
the keyboard of Baptiste Carvello:

Glenn Linderman a écrit :


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.




The problem with this strategy is: paths are often sliced, so your 2 
codepoints could get separated. The good thing with the PEP's strategy 
is that 1 character stays 1 character.


Baptiste



Except for half-surrogates that are in the file names already, which get 
converted to 3 characters.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 12:29 AM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk
with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.

Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both
bytes and str interfaces, and both get used.

Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.


No Martin, the point of reviewing the PEP is to _not_ trust you, even
though you are generally very knowledgeable and very trustworthy.  It is
much easier to find problems before something is released, or even
coded, than it is afterwards.


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


You assumed, and maybe I wasn't clear in my statement.

By accessed via the str interface I mean that (on Windows) the wide
string interface would be used to obtain a file name.


What does that mean? What specific interface are you referring to to
obtain file names? Most of the time, file names are obtained by the
user entering them on the keyboard. GUI applications are completely
out of the scope of the PEP.


Now, suppose that
the file name returned contains abc followed by the half-surrogate
U+DC10 -- four 16-bit codes.


Ok, so perhaps you might be talking about os.listdir here. Communication
would be much easier if I would not need to guess what you may mean.



os.listdir()




Also, why is U+DC10 four 16-bit codes?



It isn't.

First 16-bit code is U+0061
Second 16-bit code is U+0062
Third 16-bit code is U+0063
Fourth 16-bit code is U+DC10




Then, ask for the same filename via the bytes interface, using UTF-8
encoding.


How do you do that on Windows? You cannot just pick an encoding, such
as UTF-8, and pass that to the byte interface, and expect it to work.
If you use the byte interface, you need to encode in the file system
encoding, of course.

Also, what do you mean by ask for?? WHAT INTERFACE ARE YOU
USING Please use specific python code.



os.listdir(b)

I find that on my Windows system, with all ASCII path file names, that I 
get quite different results when I pass os.listdir an empty str vs an 
empty bytes.


Rather than keep you guessing, I get the root directory contents from 
the empty str, and the current directory contents from an empty bytes. 
That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory name 
be passed in either bytes or str form.




The PEP says that the above name would get translated to
abc followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes
used to represent the half-surrogate that is actually in the file name,
specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can
be seen as two different names in memory.


You are relying on false assumptions here, namely that the UTF-8
encoding would play any role.

What would happen instead is that the mbcs encoding would be used. The
mbcs encoding, by design from Microsoft, will never report an error,
so the error handler will not be invoked at all.



So what you are saying here is that Python doesn't use the A forms of 
the Windows APIs for filenames, but only the W forms, and uses lossy 
decoding (from MS) to the current code page (which can never be UTF-8 on 
Windows).


You are further saying that Python doesn't give the programmer control 
over the codec that is used to convert from W results to bytes, so that 
on Windows, it is impossible to obtain a bytes result containing UTF-8 
from os.listdir, even though sys.setfilesystemencoding exists, and 
sys.getfilesystemencoding is affected by it, and the latter is 
documented as returning mbcs, and as returning the codec that should 
be used by the application to convert str to bytes for filenames. 
(Python 3.0.1).


While I can hear a that is outside the scope of the PEP coming, this 
documentation is confusing, to say the least.




Now posit another file which, when accessed via the str interface, has
the name abc followed by U+DCED U+DCB0 U+DC90.

Looks ambiguous to me.  Now if you have a scheme for handling this case,
fine, but I don't understand it from what is written in the PEP.


You were just making false assumptions in your reasoning, assumptions
that are way beyond the scope of the PEP.



Absolutely correct.  I was making what seemed to be reasonable 
assumptions about Python internals on Windows, and several of them are 
false, including misleading documentation for listdir (which doesn't 
specify that bytes and str parameters affect whether or not the current

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread R. David Murray


On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote:
On approximately 4/28/2009 7:40 PM, came the following characters from the 
keyboard of R. David Murray:

 On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
  C. File on disk with the invalid surrogate code, accessed via the str 
  interface, no decoding happens, matches in memory the file on disk with 
  the byte that translates to the same surrogate, accessed via the bytes 
  interface. Ambiguity.


 Unless I'm missing something, one of these is type str, and the other is
 type bytes, so no ambiguity.



You are missing that the bytes value would get decoded to a str; thus both 
are str; so ambiguity is possible.


Only if you as the programmer decode it.  Now, I don't understand the
subtleties of Unicode enough to know if Martin has already successfully
addressed this concern in another fashion, but personally I think that
if you as a programmer are comparing funnydecoded-str strings gotten
via a string interface with normal-decoded strings gotten via a bytes
interface, that we could claim that your program has a bug.

--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 02:56, Glenn Linderman v+pyt...@g.nevcal.com wrote:
 os.listdir(b)

 I find that on my Windows system, with all ASCII path file names, that I  
 get quite different results when I pass os.listdir an empty str vs an  
 empty bytes.

 Rather than keep you guessing, I get the root directory contents from  
 the empty str, and the current directory contents from an empty bytes.  
 That is rather unexpected.

 So I guess I'd better suggest that a specific, equivalent directory name  
 be passed in either bytes or str form.

I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect  and b to be equivalent).

In ancient times,  was a valid UNIX name for the working directory.
POSIX disallows that, and requires people to use ..

Maybe you're seeing an artifact; did python move from UNIX to Windows or the
other way around in its porting history? I'd guess the former.

Do you get differing results from listdir(.) and listdir(b.) ?
How's python2 behave for ? (Since there's no b in python2.)

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

'Supposing a tree fell down, Pooh, when we were underneath it?'
'Supposing it didn't,' said Pooh after careful thought.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 4:07 AM, came the following characters from 
the keyboard of R. David Murray:

On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote:
On approximately 4/28/2009 7:40 PM, came the following characters from 
the keyboard of R. David Murray:

 On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
  C. File on disk with the invalid surrogate code, accessed via the 
str   interface, no decoding happens, matches in memory the file on 
disk with   the byte that translates to the same surrogate, accessed 
via the bytes   interface. Ambiguity.


 Unless I'm missing something, one of these is type str, and the 
other is

 type bytes, so no ambiguity.



You are missing that the bytes value would get decoded to a str; thus 
both are str; so ambiguity is possible.


Only if you as the programmer decode it.  Now, I don't understand the
subtleties of Unicode enough to know if Martin has already successfully
addressed this concern in another fashion, but personally I think that
if you as a programmer are comparing funnydecoded-str strings gotten
via a string interface with normal-decoded strings gotten via a bytes
interface, that we could claim that your program has a bug.


Hopefully Martin will clarify the PEP as I suggested in another branch 
of this thread.  He has eventually convinced me that this ambiguity is 
not possible, via email discussion, but the PEP is certainly less than 
sufficiently explanatory to make that obvious.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull

Baptiste Carvello writes:

  By contrast, if the new utf-8b codec would *supercede* the old one,
  \udcxx would always mean raw bytes (at least on UCS-4 builds, where
  surrogates are unused). Thus ambiguity could be avoided.

Unfortunately, that's false.  It could have come from a literal string
(similar to the text above ;-), a C extension, or a string slice (on
16-bit builds), and there may be other ways to do it.  The only way to
avoid ambiguity is to change the definition of a Python string to be
*valid* Unicode (possibly with Python extensions such as PEP 383 for
internal use only).  But Guido has rejected that in the past;
validation is the application's problem, not Python's.

Nor is a UCS-4 build exempt.  IIRC Guido specifically envisioned
Python strings being used to build up code point sequences to be
directly output, which means that a UCS-4 string might none-the-less
contain surrogates being added to a string intended to be sent as
UTF-16 output simply by truncating the 32-bit code units to 16 bits.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull

Martin v. Löwis writes:

  I find the case pretty artificial, though: if the locale encoding
  changes, all file names will look incorrect to the user, so he'll
  quickly switch back, or rename all the files.

It's not necessarily the case that the locale encoding changes, but
rather the name of the file.  I have a couple of directories where I
have Japanese in both EUC-JP and UTF-8, for example.  (The
applications where I never bothered to do a conversion from EUC to
UTF-8 are things like stripping MIME attachments from messages and
saving them to files when I changed my default.)

So I have a little Emacs Lisp function that tries EUC or UTF8
depending on date and falls back to the other on a decode error.

Another possible situation would be a user program in the user's
locale communicating with a daemon running in some other locale (quite
likely POSIX).

So while out of scope of the PEP, I don't think it's at all
artificial.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 Sure. However, that requires you to provide meaningful, reproducible
 counter-examples, rather than a stenographic formulation that might
 hint some problem you apparently see (which I believe is just not
 there).
 
 
 Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
 surrogates.  But such encodings are currently supported by Python, and
 they are used as part of CESU-8 coding.  That's, in fact, a common way
 of converting UTF-16 to UTF-8.  How are you going to deal with existing
 code that relies on being able to code half surrogates as UTF-8?

Can you please elaborate? What code specifically are you talking about?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 C. File on disk with the invalid surrogate code, accessed via the
 str interface, no decoding happens, matches in memory the file on disk
 with the byte that translates to the same surrogate, accessed via the
 bytes interface.  Ambiguity.
 What does that mean? What specific interface are you referring to to
 obtain file names? 
 
 os.listdir()
 
 os.listdir(b)
 
 So I guess I'd better suggest that a specific, equivalent directory name
 be passed in either bytes or str form.

[Leaving the issue of the empty string apparently having different
meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir(c:/tmp)
os.listdir(bc:/tmp)

and you have a file in c:/tmp that is named abc\uDC10.

 So what you are saying here is that Python doesn't use the A forms of
 the Windows APIs for filenames, but only the W forms, and uses lossy
 decoding (from MS) to the current code page (which can never be UTF-8 on
 Windows).

Actually, it does use the A form, in the second listdir example. This,
in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
a byte string; the listdirs should give

[abc\uDC10]
[babc?]

(not quite sure about the second - I only guess that CP_ACP will replace
the half surrogate with a question mark).

So where is the ambiguity here?

 You are further saying that Python doesn't give the programmer control
 over the codec that is used to convert from W results to bytes, so that
 on Windows, it is impossible to obtain a bytes result containing UTF-8
 from os.listdir, even though sys.setfilesystemencoding exists, and
 sys.getfilesystemencoding is affected by it, and the latter is
 documented as returning mbcs, and as returning the codec that should
 be used by the application to convert str to bytes for filenames.
 (Python 3.0.1).

Not exactly. You *can* do setfilesystemencoding on Windows, but it has
no effect, as the Python file system encoding is never used on Windows.
For a string, it passes it to the W API as is; for bytes, it passes it
to the A API as-is. Python never invokes any codec here.

 While I can hear a that is outside the scope of the PEP coming, this
 documentation is confusing, to say the least.

Only because you are apparently unaware of the status quo. If you would
study the current Python source code, it would be all very clear.

 Things are a little clearer in the documentation for
 sys.setfilesystemencoding, which does say the encoding isn't used by
 Windows -- so why is it permitted to change it, if it has no effect?).

As in many cases: because nobody contributed code to make it behave
otherwise. It's not that the file system encoding is mbcs - the
file system encoding is simply unused on Windows (but that wasn't
always the case, in particular not when Windows 9x still had to
be supported).

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 So while out of scope of the PEP, I don't think it's at all
 artificial.

Sure - but I see this as the same case as the file got renamed.
If you have a LRU list in your app, and a file gets renamed, then
the LRU list breaks (unless you also store the inode number in the
LRU list, and lookup the file by inode number - or object UUID
on NTFS, possibly using distributed link tracking).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Glenn Linderman wrote:
On approximately 4/29/2009 4:36 AM, came the following characters from 
the keyboard of Cameron Simpson:

On 29Apr2009 02:56, Glenn Linderman v+pyt...@g.nevcal.com wrote:
 

os.listdir(b)

I find that on my Windows system, with all ASCII path file names, 
that I  get quite different results when I pass os.listdir an empty 
str vs an  empty bytes.


Rather than keep you guessing, I get the root directory contents 
from  the empty str, and the current directory contents from an empty 
bytes.  That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory 
name  be passed in either bytes or str form.



I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect  and b to be equivalent).
  


Me too.


Sounds like an issue for the tracker.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Thomas Breuel wrote:


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


Well, here's another one: PEP 383 would disallow UTF-8 encodings of half 
surrogates. 


By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that.

But such encodings are currently supported by Python, and 
they are used as part of CESU-8 coding.  That's, in fact, a common way 
of converting UTF-16 to UTF-8.  How are you going to deal with existing 
code that relies on being able to code half surrogates as UTF-8?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 1:28 PM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the
str interface, no decoding happens, matches in memory the file on disk
with the byte that translates to the same surrogate, accessed via the
bytes interface.  Ambiguity.

What does that mean? What specific interface are you referring to to
obtain file names? 

os.listdir()

os.listdir(b)

So I guess I'd better suggest that a specific, equivalent directory name
be passed in either bytes or str form.


[Leaving the issue of the empty string apparently having different
meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir(c:/tmp)
os.listdir(bc:/tmp)

and you have a file in c:/tmp that is named abc\uDC10.


So what you are saying here is that Python doesn't use the A forms of
the Windows APIs for filenames, but only the W forms, and uses lossy
decoding (from MS) to the current code page (which can never be UTF-8 on
Windows).


Actually, it does use the A form, in the second listdir example. This,
in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
a byte string; the listdirs should give

[abc\uDC10]
[babc?]

(not quite sure about the second - I only guess that CP_ACP will replace
the half surrogate with a question mark).

So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.  Or is it 
the Python philosophy that the PEPs should be as incomprehensible as 
possible, to generate large discussions?



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 17:03, Terry Reedy tjre...@udel.edu wrote:
 Thomas Breuel wrote:
 Sure. However, that requires you to provide meaningful, reproducible
 counter-examples, rather than a stenographic formulation that might
 hint some problem you apparently see (which I believe is just not
 there).

 Well, here's another one: PEP 383 would disallow UTF-8 encodings of 
 half surrogates. 

 By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that.

5.0 also disallows it. No surprise I guess.
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

Out on the road, feeling the breeze, passing the cars.  - Bob Seger
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 22:14, Stephen J. Turnbull step...@xemacs.org wrote:
| Baptiste Carvello writes:
|   By contrast, if the new utf-8b codec would *supercede* the old one,
|   \udcxx would always mean raw bytes (at least on UCS-4 builds, where
|   surrogates are unused). Thus ambiguity could be avoided.
| 
| Unfortunately, that's false.  It could have come from a literal string
| (similar to the text above ;-), a C extension, or a string slice (on
| 16-bit builds), and there may be other ways to do it.  The only way to
| avoid ambiguity is to change the definition of a Python string to be
| *valid* Unicode (possibly with Python extensions such as PEP 383 for
| internal use only).  But Guido has rejected that in the past;
| validation is the application's problem, not Python's.
| 
| Nor is a UCS-4 build exempt.  IIRC Guido specifically envisioned
| Python strings being used to build up code point sequences to be
| directly output, which means that a UCS-4 string might none-the-less
| contain surrogates being added to a string intended to be sent as
| UTF-16 output simply by truncating the 32-bit code units to 16 bits.

Wouldn't you then be bypassing the implicit encoding anyway, at least to
some extent, and thus not trip over the PEP?
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

Clemson is the Harvard of cardboard packaging.
- overhead by WIRED at the Intelligent Printing conference Oct2006
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott



On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:



If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
(which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.



Forgive me if this has been covered. I've been reading this thread for  
a long time

and still have a 100 odd replies to go...

How do get a printable unicode version of these path strings if they  
contain

none unicode data?

I'm guessing that an app has to understand that filenames come in two  
forms
unicode and bytes if its not utf-8 data. Why not simply return string  
if its valid
utf-8 otherwise return bytes? Then in the app you check for the type  
for the object,

string or byte and deal with reporting errors appropriately.

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 23:41, Barry Scott ba...@barrys-emacs.org wrote:
 On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:
 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.

 Forgive me if this has been covered. I've been reading this thread for a 
 long time and still have a 100 odd replies to go...

 How do get a printable unicode version of these path strings if they  
 contain none unicode data?

Personally, I'd use repr(). One might ask, what would you expect to see
if you were printing such a string?

 I'm guessing that an app has to understand that filenames come in two  
 forms unicode and bytes if its not utf-8 data. Why not simply return string 
 if 
 its valid utf-8 otherwise return bytes? Then in the app you check for the 
 type for 
 the object, string or byte and deal with reporting errors appropriately.

Because it complicates the app enormously, for every app.

It would be _nice_ to just call os.listdir() et al with strings, get
strings, and not worry.

With strings becoming unicode in Python3, on POSIX you have an issue of
deciding how to get its filenames-are-bytes into a string and the
reverse. One could naively map the byte values to the same Unicode code
points, but that results in strings that do not contain the same
characters as the user/app expects for byte values above 127.

Since POSIX does not really have a filesystem level character encoding,
just a user environment setting that says how the current user encodes
characters into bytes (UTF-8 is increasingly common and useful, but
it is not universal), it is more useful to decode filenames on the
assumption that they represent characters in the user's (current) encoding
convention; that way when things are displayed they are meaningful,
and they interoperate well with strings made by the user/app. If all
the filenames were actually encoded that way when made, that works. But
different users may adopt different conventions, and indeed a user may
have used ACII or and ISO8859-* coding in the past and be transitioning
to something else now, so they will have a bunch of files in different
encodings.

The PEP uses the user's current encoding with a handler for byte
sequences that don't decode to valid Unicode scaler values in
a fashion that is reversible. That is, you get strings out of
listdir() and those strings will go back in (eg to open()) perfectly
robustly.

Previous approaches would either silently hide non-decodable names in
listdir() results or throw exceptions when the decode failed or mangle
things no reversably. I believe Python3 went with the first option
there.

The PEP at least lets programs naively access all files that exist,
and create a filename from any well-formed unicode string provided that
the filesystem encoding permits the name to be encoded.

The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
_not_ well-formed unicode (== have bare surrogates in them) but that
were intended for use as filenames will conflict with the PEP's scheme -
programs must know that these strings came from outside and must be
translated into the PEP's funny-encoding before use in the os.*
functions. Previous to the PEP they would get used directly and
encode differently after the PEP, thus producing different POSIX
filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
using a rare-in-filenames character.
That would avoid the issue with outside' strings that contain
surrogates. To my mind it just moves the punning from rare illegal
strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
os.listdir but a subclass of string (or at least a duck-type of
string) that knows where it came from and is also handily
recognisable as not-really-a-string for purposes of deciding
whether is it PEP-funny-encoded by direct inspection.

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

The peever can look at the best day in his life and sneer at it.
- Jim Hill, JennyGfest '95
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Aahz

On Thu, Apr 30, 2009, Cameron Simpson wrote:

 The lengthy discussion mostly revolves around:
 
   - Glenn points out that strings that came _not_ from listdir, and that are
 _not_ well-formed unicode (== have bare surrogates in them) but that
 were intended for use as filenames will conflict with the PEP's scheme -
 programs must know that these strings came from outside and must be
 translated into the PEP's funny-encoding before use in the os.*
 functions. Previous to the PEP they would get used directly and
 encode differently after the PEP, thus producing different POSIX
 filenames. Breakage.
 
   - Glenn would like the encoding to use Unicode scalar values only,
 using a rare-in-filenames character.
 That would avoid the issue with outside' strings that contain
 surrogates. To my mind it just moves the punning from rare illegal
 strings to merely uncommon but legal characters.
 
   - Some parties think it would be better to not return strings from
 os.listdir but a subclass of string (or at least a duck-type of
 string) that knows where it came from and is also handily
 recognisable as not-really-a-string for purposes of deciding
 whether is it PEP-funny-encoded by direct inspection.

Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur.  --Red Adair
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Glenn Linderman wrote:
On approximately 4/29/2009 1:28 PM, came the following characters from 



So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


If you think a section of the doc is grossly inadequate, and there is no 
existing issue on the tracker, feel free to add one.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.


Where the PEP proposes to change, it should start with the status quo. 
But Martin's somewhat reasonable position is that since he is not 
proposing to change behavior on Windows, it is not his responsibility to 
document what he is not proposing to change more adequately.  This 
means, of course, that any observed change on Windows would then be a 
bug, or at least a break of the promise.  On the other hand, I can see 
that this is enough related to what he is proposing to change that 
better doc would help.


tjr

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 How do get a printable unicode version of these path strings if they
 contain none unicode data?

Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.

 I'm guessing that an app has to understand that filenames come in two forms
 unicode and bytes if its not utf-8 data. Why not simply return string if
 its valid utf-8 otherwise return bytes?

That would have been an alternative solution, and the one that 2.x uses
for listdir. People didn't like it.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

 Thanks for clarifying the Windows behavior, here.  A little more
 clarification in the PEP could have avoided lots of discussion.  It
 would seem that a PEP, proposed to modify a poorly documented (and
 therefore likely poorly understood) area, should be educational about
 the status quo, as well as presenting the suggested change.  Or is it
 the Python philosophy that the PEPs should be as incomprehensible as
 possible, to generate large discussions?

Certainly not. See PEP 277 for a description of a specification of
how file names are handled on Windows.

Large discussions could be reduced if readers would try to
constructively comment on the PEP, rather than making counter-proposals,
or making statements about the PEP without making their implied
assumptions explicit.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

James Y Knight wrote:
 Hopefully it can be assumed that your locale encoding really is a
 non-overlapping superset of ASCII, as is required by POSIX...

Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?

 I'm a bit scared at the prospect that U+DCAF could turn into /, that
 just screams security vulnerability to me.  So I'd like to propose that
 only 0x80-0xFF - U+DC80-U+DCFF should ever be allowed to be
 encoded/decoded via the error handler.

It would be actually U+DC2f that would turn into /.
I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/27/2009 7:11 PM, came the following characters from 
the keyboard of Cameron Simpson:

On 27Apr2009 18:15, Glenn Linderman v+pyt...@g.nevcal.com wrote:
  

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.



Why is it necessary that you are able to make this distinction?
  
  
It is necessary that programs (not me) can make the distinction, so 
that  it knows whether or not to do the funny-encoding or not.



I would say this isn't so. It's important that programs know if they're
dealing with strings-for-filenames, but not that they be able to figure
that out a priori if handed a bare string (especially since they
can't:-)
  
So you agree they can't... that there are data puns.   (OK, you may not  
have thought that through)



I agree you can't examine a string and know if it came from the os.* munging
or from someone else's munging.

I totally disagree that this is a problem.

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) - funny-encoded Unicode
This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) - bytes
This is what open(filename,..) does to turn the filename into bytes
for the POSIX open.
  os.pathencode(your-string) - funny-encoded-Unicode
This is what you must do to a de novo string to turn it into a
string suitable for use by open.
Importantly, for most strings not hand crafted to have weird
sequences in them, it is a no-op. But it will recode your puns
for survival.

and for me, I would like to see:

  os.setfilesystemencoding(coding)

Currently os.getfilesystemencoding() returns you the encoding based on
the current locale, and (I trust) the os.* stuff encodes on that basis.
setfilesystemencoding() would override that, unless coding==None in what
case it reverts to the former use the user's current locale behaviour.
(We have locale C for what one might otherwise expect None to mean:-)

The idea here is to let to program control the codec used for filenames
for special purposes, without working indirectly through the locale.

  
If a name is  funny-decoded when the name is accessed by a directory 
listing, it needs  to be funny-encoded in order to open the file.


Hmm. I had thought that legitimate unicode strings already get transcoded
to bytes via the mapping specified by sys.getfilesystemencoding()
(the user's locale). That already happens I believe, and Martin's
scheme doesn't change this. He's just funny-encoding non-decodable byte
sequences, not the decoded stuff that surrounds them.
  
So assume a non-decodable sequence in a name.  That puts us into  
Martin's funny-decode scheme.  His funny-decode scheme produces a bare  
string, indistinguishable from a bare string that would be produced by a  
str API that happens to contain that same sequence.  Data puns.



See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.

_Lacking_ such a function, your punning concern is valid.
  


Seems like one would also desire os.pathdecode to do the reverse.  And 
also versions that take or produce bytes from funny-encoded strings.


Then, if programs were re-coded to perform these transformations on what 
you call de novo strings, then the scheme would work.


But I think a large part of the incentive for the PEP is to try to 
invent a scheme that intentionally allows for the puns, so that programs 
do not need to be recoded in this manner, and yet still work.  I don't 
think such a scheme exists.


If there is going to be a required transformation from de novo strings 
to funny-encoded strings, then why not make one that people can actually 
see and compare and decode from the displayable form, by using 
displayable characters instead of lone surrogates?



So when open is handed the string, should it open the file with the name  
that matches the string, or the file with the name that funny-decodes to  
the same string?  It can't know, unless it knows that the string is a  
funny-decoded string or not.



True. open() should always expect a funny-encoded name.

  

So it is already the case that strings get decoded to

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

 Does the PEP take into consideration the normalising behaviour of Mac
 OSX ? We've had some ongoing challenges in bzr related to this with bzr.

No, that's completely out of scope, AFAICT. I don't even know what the
issues are, so I'm not able to propose a solution, at the moment.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore

2009/4/28 Glenn Linderman v+pyt...@g.nevcal.com:
 So assume a non-decodable sequence in a name.  That puts us into Martin's
 funny-decode scheme.  His funny-decode scheme produces a bare string,
 indistinguishable from a bare string that would be produced by a str API
 that happens to contain that same sequence.  Data puns.

 So when open is handed the string, should it open the file with the name
 that matches the string, or the file with the name that funny-decodes to the
 same string?  It can't know, unless it knows that the string is a
 funny-decoded string or not.

Sorry for picking on Glenn's comment - it's only one of many in this
thread. But it seems to me that there is an assumption that problems
will arise when code gets a potentially funny-decoded string and
doesn't know where it came from.

Is that a real concern? How many programs really don't know where
their data came from? Maybe a general-purpose library routine *might*
just need to document explicitly how it handles funny-encoded data (I
can't actually imagine anything that would, but I'll concede it may be
possible) but that's just a matter of documenting your assumptions -
no better or worse than many other cases.

This all sounds similar to the idea of tainted data in security - if
you lose track of untrusted data from the environment, you expose
yourself to potential security issues. So the same techniques should
be relevant here (including ignoring it if your application isn't such
that it's s concern!)

I've yet to hear anyone claim that they would have an actual problem
with a specific piece of code they have written. (NB, if such a claim
has been made, feel free to point me to it - I admit I've been
skimming this thread at times).

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore

2009/4/28 Antoine Pitrou solip...@pitrou.net:
 Paul Moore p.f.moore at gmail.com writes:

 I've yet to hear anyone claim that they would have an actual problem
 with a specific piece of code they have written.

 Yep, that's the problem. Lots of theoretical problems noone has ever 
 encountered
 brought up against a PEP which resolves some actual problems people encounter 
 on
 a regular basis.

 For the record, I'm +1 on the PEP being accepted and implemented as soon as
 possible (preferably before 3.1).

In case it's not clear, I am also +1 on the PEP as it stands.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Foord


Paul Moore wrote:

2009/4/28 Antoine Pitrou solip...@pitrou.net:
  

Paul Moore p.f.moore at gmail.com writes:


I've yet to hear anyone claim that they would have an actual problem
with a specific piece of code they have written.
  

Yep, that's the problem. Lots of theoretical problems noone has ever encountered
brought up against a PEP which resolves some actual problems people encounter on
a regular basis.

For the record, I'm +1 on the PEP being accepted and implemented as soon as
possible (preferably before 3.1).



In case it's not clear, I am also +1 on the PEP as it stands.
  


Me 2

Michael

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
  



--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Ronald Oussoren


For what it's worth, the OSX API's seem to behave as follows:

* If you create a file with an non-UTF8 name on a HFS+ filesystem the  
system automaticly encodes the name.


That is,  open(chr(255), 'w') will silently create a file named '%FF'  
instead of the name you'd expect on a unix system.


* If you mount an NFS filesystem from a linux host and that directory  
contains a file named chr(255)


- unix-level tools will see a file with the expected name (just like  
on linux)
- Cocoa's NSFileManager returns u? as the filename, that is when the  
filename cannot be decoded using UTF-8 the name returned by the high- 
level API is mangled. This is regardless of the setting of LANG.
- I haven't found a way yet to access files whose names are not valid  
UTF-8 using the high-level Cocoa API's.


The latter two are interesting because Cocoa has a unicode filesystem  
API on top of a POSIX C-API, just like Python 3.x. I guess the choosen  
behaviour works out on OSX (where users are unlikely to run into this  
issue), but could be more problematic on other POSIX systems.


Ronald

On 28 Apr, 2009, at 14:03, Michael Foord wrote:


Paul Moore wrote:

2009/4/28 Antoine Pitrou solip...@pitrou.net:


Paul Moore p.f.moore at gmail.com writes:

I've yet to hear anyone claim that they would have an actual  
problem

with a specific piece of code they have written.

Yep, that's the problem. Lots of theoretical problems noone has  
ever encountered
brought up against a PEP which resolves some actual problems  
people encounter on

a regular basis.

For the record, I'm +1 on the PEP being accepted and implemented  
as soon as

possible (preferably before 3.1).



In case it's not clear, I am also +1 on the PEP as it stands.



Me 2

Michael

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk




--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com




smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Thomas Breuel


 Yep, that's the problem. Lots of theoretical problems noone has ever
 encountered
 brought up against a PEP which resolves some actual problems people
 encounter on
 a regular basis.


How can you bring up practical problems against something that hasn't been
implemented?

The fact that no other language or library does this is perhaps an
indication that it isn't the right thing to do.

But the biggest problem with the proposal is that it isn't needed: if you
want to be able to turn arbitrary byte sequences into unicode strings and
back, just set your encoding to iso8859-15.  That already works and it
doesn't require any changes.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic


Thomas Breuel wrote:
But the biggest problem with the proposal is that it isn't needed: if 
you want to be able to turn arbitrary byte sequences into unicode 
strings and back, just set your encoding to iso8859-15.  That already 
works and it doesn't require any changes.


Are you proposing to unconditionally encode file names as iso8859-15, or 
to do so only when undecodeable bytes are encountered?


If you unconditionally set encoding to iso8859-15, then you are 
effectively reverting to treating file names as bytes, regardless of the 
locale.  You're also angering a lot of European users who expect 
iso8859-2, etc.


If you switch to iso8859-15 only in the presence of undecodable UTF-8, 
then you have the same round-trip problem as the PEP: both b'\xff' and 
b'\xc3\xbf' will be converted to u'\u00ff' without a way to 
unambiguously recover the original file name.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico

2009/4/28 Glenn Linderman v+pyt...@g.nevcal.com:
 The switch from PUA to half-surrogates does not resolve the issues with the
 encoding not being a 1-to-1 mapping, though.  The very fact that you  think
 you can get away with use of lone surrogates means that other people might,
 accidentally or intentionally, also use lone surrogates for some other
 purpose.  Even in file names.

It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.

Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' - '\udced\udcb3\udcbf' - b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' - '\udcff' - b'\xff'

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic


Lino Mastrodomenico wrote:

Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character 
when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').


Should be considered or will be considered?  Python 3.0's UTF-8 
decoder happily accepts it and returns u'\udcff':


 b'\xed\xb3\xbf'.decode('utf-8')
'\udcff'

If the PEP depends on this being changed, it should be mentioned in the PEP.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico

2009/4/28 Hrvoje Niksic hrvoje.nik...@avl.com:
 Lino Mastrodomenico wrote:

 Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid
 character when
 decoded with UTF-8, it should simply be considered an invalid UTF-8
 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
 '\udcff').

 Should be considered or will be considered?  Python 3.0's UTF-8 decoder
 happily accepts it and returns u'\udcff':

 b'\xed\xb3\xbf'.decode('utf-8')
 '\udcff'

Only for the new utf-8b encoding (if Martin agrees), while the
existing utf-8 is fine as is (or at least waaay outside the scope of
this PEP).

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Urman

On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull step...@xemacs.org wrote:
 Nobody said we were at the stage of *saving* the [attachment]!

But speaking of saving files, I think that's the biggest hole in this
that has been nagging at the back of my mind. This PEP intends to
allow easy access to filenames and other environment strings which are
not restricted to known encodings. What happens if the detected
encoding changes? There may be difficulties de/serializing these
names, such as for an MRU list.

Since the serialization of the Unicode string is likely to use UTF-8,
and the string for  such a file will include half surrogates, the
application may raise an exception when encoding the names for a
configuration file. These encoding exceptions will be as rare as the
unusual names (which the careful I18N aware developer has probably
eradicated from his system), and thus will appear late.

Or say de/serialization succeeds. Since the resulting Unicode string
differs depending on the encoding (which is a good thing; it is
supposed to make most cases mostly readable), when the filesystem
encoding changes (say from legacy to UTF-8), the name changes, and
deserialized references to it become stale.

This can probably be handled through careful use of the same
encoding/decoding scheme, if relevant, but that sounds like we've just
moved the problem from fs/environment access to serialization. Is that
good enough? For other uses the API knew whether it was
environmentally aware, but serialization probably will not. Should
this PEP make recommendations about how to save filenames in
configuration files?

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Stephen J. Turnbull

Paul Moore writes:

  But it seems to me that there is an assumption that problems will
  arise when code gets a potentially funny-decoded string and doesn't
  know where it came from.
  
  Is that a real concern?

Yes, it's a real concern.  I don't think it's possible to show a small
piece of code one could point at and say without a better API I bet
you can't write this correctly, though.  Rather, my experience with
Emacs and various mail packages is that without type information it is
impossible to keep track of the myriad bits and pieces of text that
are recombining like pig flu, and eventually one breaks out and causes
an error.  It's usually easy to fix, but so are the next hundred
similar regressions, and in the meantime a hundred users have suffered
more or less damage or at least annoyance.

There's no question that dealing with escapes of funny-decoded strings
to uprepared code paths is mission creep compared to Martin's stated
purpose for PEP 383, but it is also a real problem.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

 It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
 not a valid Unicode character (not a character at all, really) and the
 only way you can put this in a POSIX filename is if you use a very
 lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.
 
 Since this byte sequence doesn't represent a valid character when
 decoded with UTF-8, it should simply be considered an invalid UTF-8
 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
 '\udcff').
 
 Martin: maybe the PEP should say this explicitly?

Sure, will do.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

 Since the serialization of the Unicode string is likely to use UTF-8,
 and the string for  such a file will include half surrogates, the
 application may raise an exception when encoding the names for a
 configuration file. These encoding exceptions will be as rare as the
 unusual names (which the careful I18N aware developer has probably
 eradicated from his system), and thus will appear late.

There are trade-offs to any solution; if there was a solution without
trade-offs, it would be implemented already.

The Python UTF-8 codec will happily encode half-surrogates; people argue
that it is a bug that it does so, however, it would help in this
specific case.

An alternative that doesn't suffer from the risk of not being able to
store decoded strings would have been the use of PUA characters, but
people rejected it because of the potential ambiguities. So they clearly
dislike one risk more than the other. UTF-8b is primarily meant as
an in-memory representation.

 Or say de/serialization succeeds. Since the resulting Unicode string
 differs depending on the encoding (which is a good thing; it is
 supposed to make most cases mostly readable), when the filesystem
 encoding changes (say from legacy to UTF-8), the name changes, and
 deserialized references to it become stale.

That problem has nothing to do with the PEP. If the encoding changes,
LRU entries may get stale even if there were no encoding errors at
all. Suppose the old encoding was Latin-1, and the new encoding is
KOI8-R, then all file names are decodable before and afterwards, yet
the string representation changes. Applications that want to protect
themselves against that happening need to store byte representations
of the file names, not character representations. Depending on the
configuration file format, that may or may not be possible.

I find the case pretty artificial, though: if the locale encoding
changes, all file names will look incorrect to the user, so he'll
quickly switch back, or rename all the files. As an application
supporting a LRU list, I would remove/hide all entries that don't
correlate to existing files - after all, the user may have as well
deleted the file in the LRU list.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis


 If the PEP depends on this being changed, it should be mentioned in the
 PEP.

The PEP says that the utf-8b codec decodes invalid bytes into low
surrogates. I have now clarified that a strict definition of UTF-8
is assumed for utf-8b.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread James Y Knight



On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively)  
that all encodings must be supersets of ASCII and use the same  
codepoints, though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire  
reason why EUC-JP was created, so I'm pretty sure that it is in fact  
inappropriate, and I cannot find any evidence of it ever being used on  
any system.


From http://en.wikipedia.org/wiki/EUC-JP:
To get the EUC form of an ISO-2022 character, the most significant  
bit of each 7-bit byte of the original ISO 2022 codes is set (by  
adding 128 to each of these original 7-bit codes); this allows  
software to easily distinguish whether a particular byte in a  
character string belongs to the ISO-646 code or the ISO-2022 (EUC)  
code.


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html


I'm a bit scared at the prospect that U+DCAF could turn into /,  
that
just screams security vulnerability to me.  So I'd like to propose  
that

only 0x80-0xFF - U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce  
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people  
actually want to use as their default encoding, despite it being  
broken to do so. RedHat apparently refuses to provide it as a locale  
charset (due to its brokenness), and it's also not available by  
default on my Debian system. People do unfortunately seem to actually  
use it in real life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The python-escape error handler when given a non-decodable byte from  
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- 
decodable byte from 0x00 to 0x7F, it will be converted to U+-U 
+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded  
into 0x80 to 0xFF, and all other characters are treated in whatever  
way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets,  
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works  
for Shift-JIS and other similar ASCII-supersets with overlaps in  
trailing bytes of a multibyte sequence. So, a sequence like  
\x81\xFD.decode(shift-jis, python-escape) will turn into  
u\uDC81\u00fd. Which will then properly encode back into \x81\xFD.


The character sets this *doesn't* work for are: ebcdic code pages  
(obviously completely unsuitable for a locale encoding on unix),  
iso2022-* (covered above), and shift-jisx0213 (because it has replaced  
\ with yen, and - with overline).


If it's desirable to work with shift_jisx0213, a modification of the  
proposal can be made: Change the second sentence to: When given a non- 
decodable byte from 0x00 to 0x7F, that byte must be the second or  
later byte in a multibyte sequence. In such a case, the error handler  
will produce the encoding of that byte if it was standing alone (thus  
in most encodings, \x00-\x7f turn into U+00-U+7F).


It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like  
some people do actually use shift_jisx0213, unfortunately.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:00 AM, came the following characters from 
the keyboard of Martin v. Löwis:



An alternative that doesn't suffer from the risk of not being able to
store decoded strings would have been the use of PUA characters, but
people rejected it because of the potential ambiguities. So they clearly
dislike one risk more than the other. UTF-8b is primarily meant as
an in-memory representation.


The UTF-8b representation suffers from the same potential ambiguities as 
the PUA characters... perhaps slightly less likely in practice, due to 
the use of Unicode-illegal characters, but exactly the same theoretical 
likelihood in the space of Python-acceptable character codes.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB


James Y Knight wrote:


On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) that 
all encodings must be supersets of ASCII and use the same codepoints, 
though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
reason why EUC-JP was created, so I'm pretty sure that it is in fact 
inappropriate, and I cannot find any evidence of it ever being used on 
any system.


 From http://en.wikipedia.org/wiki/EUC-JP:
To get the EUC form of an ISO-2022 character, the most significant bit 
of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
to each of these original 7-bit codes); this allows software to easily 
distinguish whether a particular byte in a character string belongs to 
the ISO-646 code or the ISO-2022 (EUC) code.


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html



I'm a bit scared at the prospect that U+DCAF could turn into /, that
just screams security vulnerability to me.  So I'd like to propose that
only 0x80-0xFF - U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce 
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people 
actually want to use as their default encoding, despite it being broken 
to do so. RedHat apparently refuses to provide it as a locale charset 
(due to its brokenness), and it's also not available by default on my 
Debian system. People do unfortunately seem to actually use it in real 
life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The python-escape error handler when given a non-decodable byte from 
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
non-decodable byte from 0x00 to 0x7F, it will be converted to 
U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
encoded into 0x80 to 0xFF, and all other characters are treated in 
whatever way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets, 
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
bytes of a multibyte sequence. So, a sequence like 
\x81\xFD.decode(shift-jis, python-escape) will turn into 
u\uDC81\u00fd. Which will then properly encode back into \x81\xFD.


The character sets this *doesn't* work for are: ebcdic code pages 
(obviously completely unsuitable for a locale encoding on unix), 
iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
with yen, and - with overline).


If it's desirable to work with shift_jisx0213, a modification of the 
proposal can be made: Change the second sentence to: When given a 
non-decodable byte from 0x00 to 0x7F, that byte must be the second or 
later byte in a multibyte sequence. In such a case, the error handler 
will produce the encoding of that byte if it was standing alone (thus in 
most encodings, \x00-\x7f turn into U+00-U+7F).


It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like 
some people do actually use shift_jisx0213, unfortunately.



I've been thinking of python-escape only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.

But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.

2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.

3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.

4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.

I think I've covered all the possibilities. :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Zooko O'Whielacronx


On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:

Are you proposing to unconditionally encode file names as  
iso8859-15, or to do so only when undecodeable bytes are encountered?


For what it is worth, what we have previously planned to do for the  
Tahoe project is the second of these -- decode using some 1-byte  
encoding such as iso-8859-1, iso-8859-15, or windows-1252 only in the  
case that attempting to decode the bytes using the local alleged  
encoding failed.


If you switch to iso8859-15 only in the presence of undecodable  
UTF-8, then you have the same round-trip problem as the PEP: both  
b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a  
way to unambiguously recover the original file name.


Why do you say that?  It seems to work as I expected here:

 '\xff'.decode('iso-8859-15')
u'\xff'
 '\xc3\xbf'.decode('iso-8859-15')
u'\xc3\xbf'



 '\xff'.decode('cp1252')
u'\xff'
 '\xc3\xbf'.decode('cp1252')
u'\xc3\xbf'

Regards,

Zooko
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:53 AM, came the following characters from 
the keyboard of James Y Knight:


On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) that 
all encodings must be supersets of ASCII and use the same codepoints, 
though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
reason why EUC-JP was created, so I'm pretty sure that it is in fact 
inappropriate, and I cannot find any evidence of it ever being used on 
any system.



It would seem from the definition of ISO-2022 that what it calls escape 
sequences is in your POSIX spec called locking-shift encoding. 
Therefore, the second bullet item under the Character Encoding heading 
prohibits use of ISO-2022, for whatever uses that document defines 
(which, since you referenced it, I assume means locales, and possibly 
file system encodings, but I'm not familiar with the structure of all 
the POSIX standards documents).


A locking-shift encoding (where the state of the character is determined 
by a shift code that may affect more than the single character following 
it) cannot be defined with the current character set description file 
format. Use of a locking-shift encoding with any of the standard 
utilities in the XCU specification or with any of the functions in the 
XSH specification that do not specifically mention the effects of 
state-dependent encoding is implementation-dependent.





 From http://en.wikipedia.org/wiki/EUC-JP:
To get the EUC form of an ISO-2022 character, the most significant bit 
of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
to each of these original 7-bit codes); this allows software to easily 
distinguish whether a particular byte in a character string belongs to 
the ISO-646 code or the ISO-2022 (EUC) code.


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html



I'm a bit scared at the prospect that U+DCAF could turn into /, that
just screams security vulnerability to me.  So I'd like to propose that
only 0x80-0xFF - U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce 
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people 
actually want to use as their default encoding, despite it being broken 
to do so. RedHat apparently refuses to provide it as a locale charset 
(due to its brokenness), and it's also not available by default on my 
Debian system. People do unfortunately seem to actually use it in real 
life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The python-escape error handler when given a non-decodable byte from 
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
non-decodable byte from 0x00 to 0x7F, it will be converted to 
U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
encoded into 0x80 to 0xFF, and all other characters are treated in 
whatever way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets, 
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
bytes of a multibyte sequence. So, a sequence like 
\x81\xFD.decode(shift-jis, python-escape) will turn into 
u\uDC81\u00fd. Which will then properly encode back into \x81\xFD.


The character sets this *doesn't* work for are: ebcdic code pages 
(obviously completely unsuitable for a locale encoding on unix), 



Why is that obvious?  The only thing I saw that could exclude EBCDIC 
would be the requirement that the codes be positive in a char, but on a 
system where the C compiler treats char as unsigned, EBCDIC would qualify.


Of course, the use of EBCDIC would also restrict the other possible code 
pages to those derived from EBCDIC (rather than the bulk of code pages 
that are derived from ASCII), due to:


If the encoded values associated with each member of the portable 
character set are not invariant across all locales supported by the 
implementation, the results achieved by an application accessing those 
locales are unspecified.



iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
with yen, and - with overline).


If it's

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

 The UTF-8b representation suffers from the same potential ambiguities as
 the PUA characters... 

Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 11:55 AM, came the following characters from 
the keyboard of MRAB:

I've been thinking of python-escape only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean excluding instead of including?



I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 to 
U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a bytes 
interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint followed 
by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
not followed by the escape codepoint, or by a codepoint in the range 
U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding as 
would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and escape 
characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an annoyance 
with the PEP, that the codepoints used are not ones that are easily 
displayed, so endecodable names could easily result in long strings of 
indistinguishable substitution characters.


It, like MRAB's proposal, also avoids data puns, which is a major 
problem with the PEP.  I consider this proposal to be easier to 
understand than MRAB's proposal, or the PEP, because of the single 
escape codepoint and the use of visible characters.


This proposal, like my initial one, also decodes and encodes (just the 
escape codes) values on the str interfaces.  This is necessary to avoid 
data puns on systems that provide both types of interfaces.


This proposal could be used for programs that use str values, and easily 
migrates to a solution that provides an object that provides an 
abstraction for system interfaces that have two forms.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 6:01 AM, came the following characters from 
the keyboard of Lino Mastrodomenico:

2009/4/28 Glenn Linderman v+pyt...@g.nevcal.com:

The switch from PUA to half-surrogates does not resolve the issues with the
encoding not being a 1-to-1 mapping, though.  The very fact that you  think
you can get away with use of lone surrogates means that other people might,
accidentally or intentionally, also use lone surrogates for some other
purpose.  Even in file names.


It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.



Wrong.

An 8859-1 locale allows any byte sequence to placed into a POSIX filename.

And while U+DCFF is illegal alone in Unicode, it is not illegal in 
Python str values.  And from my testing, Python 3's current UTF-8 
encoder will happily provide exactly the bytes value you mention when 
given U+DCFF.




Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' - '\udced\udcb3\udcbf' - b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' - '\udcff' - b'\xff'



Others have made this suggestion, and it is helpful to the PEP, but not 
sufficient.  As implemented as an error handler, I'm not sure that the 
b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 
decoder is happy with it.  Which, in my testing, it is.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 1:25 PM, came the following characters from 
the keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as
the PUA characters... 


Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.


C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk with 
the byte that translates to the same surrogate, accessed via the bytes 
interface.  Ambiguity.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

 Others have made this suggestion, and it is helpful to the PEP, but not
 sufficient.  As implemented as an error handler, I'm not sure that the
 b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8
 decoder is happy with it.  Which, in my testing, it is.

Rest assured that the utf-8b codec will work the way it is specified.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB


Glenn Linderman wrote:
On approximately 4/28/2009 11:55 AM, came the following characters from 
the keyboard of MRAB:

I've been thinking of python-escape only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).


2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean excluding instead of including?


Perhaps I should've said Any codepoint which can't be produced by
decoding should raise an exception.

For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.




I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 to 
U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a bytes 
interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint followed 
by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
not followed by the escape codepoint, or by a codepoint in the range 
U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding as 
would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and escape 
characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an annoyance 
with the PEP, that the codepoints used are not ones that are easily 
displayed, so endecodable names could easily result in long strings of 
indistinguishable substitution characters.



Perhaps the escape character should be U+005C. ;-)

It, like MRAB's proposal, also avoids data puns, which is a major 
problem with the PEP.  I consider this proposal to be easier to 
understand than MRAB's proposal, or the PEP, because of the single 
escape codepoint and the use of visible characters.


This proposal, like my initial one, also decodes and encodes (just the 
escape codes) values on the str interfaces.  This is necessary to avoid 
data puns on systems that provide both types of interfaces.


This proposal could be used for programs that use str values, and easily 
migrates to a solution that provides an object that provides an 
abstraction for system interfaces that have two forms.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

Glenn Linderman wrote:
 On approximately 4/28/2009 1:25 PM, came the following characters from
 the keyboard of Martin v. Löwis:
 The UTF-8b representation suffers from the same potential ambiguities as
 the PUA characters... 

 Not at all the same ambiguities. Here, again, the two choices:

 A. use PUA characters to represent undecodable bytes, in particular for
UTF-8 (the PEP actually never proposed this to happen).
This introduces an ambiguity: two different files in the same
directory may decode to the same string name, if one has the PUA
character, and the other has a non-decodable byte that gets decoded
to the same PUA character.

 B. use UTF-8b, representing the byte will ill-formed surrogate codes.
The same ambiguity does *NOT* exist. If a file on disk already
contains an invalid surrogate code in its file name, then the UTF-8b
decoder will recognize this as invalid, and decode it byte-for-byte,
into three surrogate codes. Hence, the file names that are different
on disk are also different in memory. No ambiguity.
 
 C. File on disk with the invalid surrogate code, accessed via the str
 interface, no decoding happens, matches in memory the file on disk with
 the byte that translates to the same surrogate, accessed via the bytes
 interface.  Ambiguity.

Is that an alternative to A and B?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 2:02 PM, came the following characters from 
the keyboard of Martin v. Löwis:

Glenn Linderman wrote:

On approximately 4/28/2009 1:25 PM, came the following characters from
the keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as
the PUA characters... 

Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.


Is that an alternative to A and B?


I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both 
bytes and str interfaces, and both get used.


On a Windows system, perhaps the ambiguous case would be the use of the 
str API and bytes APIs producing different memory names for the same 
file that contains a (Unicode-illegal) half surrogate.  The 
half-surrogate would seem to get decoded to 3 half surrogates if 
accessed via the bytes interface, but only one via the str interface. 
The version with 3 half surrogates could match another name that 
actually contains 3 half surrogates, that is accessed via the str interface.


I can't actually tell by reading the PEP whether it affects Windows 
bytes interfaces or is only implemented on POSIX, so that POSIX has a 
str interface.


If it is only implemented on POSIX, then the current scheme (now 
escaping the hundreds of escape codes) could work, within a single 
platform... but it would still suffer from displaying garbage (sequences 
of replacement characters) in file listings displayed or printed.  There 
is no way, once the string is adjusted to contain replacement characters 
for display, to distinguish one file name from another, if they are 
identical except for a same-length sequence of different undecodable bytes.


The concept of a function that allows the same decoding and encoding 
process for 3rd party interfaces is still missing from the PEP; 
implementation of the PEP would require that all interfaces to 3rd party 
software that accept file names would have to be transcoded by the 
interface layer.  Or else such software would have to use the bytes 
interfaces directly, and if they do, there is no need for the PEP.


So I see the PEP as a partial solution to a limited problem, that on the 
one hand potentially produces indistinguishable sequences of replacement 
characters in filenames, rather than the mojibake (which is at least 
distinguishable), and on the other hand, doesn't help software that also 
uses 3rd party libraries to avoid the use of bytes APIs for accessing 
file names.  There are other encodings that produce more distinguishable 
mojibake, and would work in the same situations as the PEP.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

I think I may be able to resolve Glenn's issues with the scheme lower
down (through careful use of definitions and hand waving).

On 27Apr2009 23:52, Glenn Linderman v+pyt...@g.nevcal.com wrote:
 On approximately 4/27/2009 7:11 PM, came the following characters from  
 the keyboard of Cameron Simpson:
[...]
 There may be puns. So what? Use the right strings for the right purpose
 and all will be well.

 I think what is missing here, and missing from Martin's PEP, is some
 utility functions for the os.* namespace.

 PROPOSAL: add to the PEP the following functions:

   os.fsdecode(bytes) - funny-encoded Unicode
 This is what os.listdir() does to produce the strings it hands out.
   os.fsencode(funny-string) - bytes
 This is what open(filename,..) does to turn the filename into bytes
 for the POSIX open.
   os.pathencode(your-string) - funny-encoded-Unicode
 This is what you must do to a de novo string to turn it into a
 string suitable for use by open.
 Importantly, for most strings not hand crafted to have weird
 sequences in them, it is a no-op. But it will recode your puns
 for survival.
[...]
 So assume a non-decodable sequence in a name.  That puts us into   
 Martin's funny-decode scheme.  His funny-decode scheme produces a 
 bare  string, indistinguishable from a bare string that would be 
 produced by a  str API that happens to contain that same sequence.  
 Data puns.
 

 See my proposal above. Does it address your concerns? A program still
 must know the providence of the string, and _if_ you're working with
 non-decodable sequences in a names then you should transmute then into
 the funny encoding using the os.pathencode() function described above.

 In this way the punning issue can be avoided.
 _Lacking_ such a function, your punning concern is valid.

 Seems like one would also desire os.pathdecode to do the reverse.

Yes.

 And  
 also versions that take or produce bytes from funny-encoded strings.

Isn't that the first two functions above?

 Then, if programs were re-coded to perform these transformations on what  
 you call de novo strings, then the scheme would work.
 But I think a large part of the incentive for the PEP is to try to  
 invent a scheme that intentionally allows for the puns, so that programs  
 do not need to be recoded in this manner, and yet still work.  I don't  
 think such a scheme exists.

I agree no such scheme exists. I don't think it can, just using strings.

But _unless_ you have made a de novo handcrafted string with
ill-formed sequences in it, you don't need to bother because you
won't _have_ puns. If Martin's using half surrogates to encode
undecodable bytes, then no normal string should conflict because a
normal string will contain _only_ Unicode scalar values. Half surrogate
code points are not such.

The advantage here is that unless you've deliberately constructed an
ill-formed unicode string, you _do_not_ need to recode into
funny-encoding, because you are already compatible. Somewhat like one
doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.

 If there is going to be a required transformation from de novo strings  
 to funny-encoded strings, then why not make one that people can actually  
 see and compare and decode from the displayable form, by using  
 displayable characters instead of lone surrogates?

Because that would _not_ be a no-op for well formed Unicode strings.

That reason is sufficient for me.

I consider the fact that well-formed Unicode - funny-encoded is a no-op
to be an enormous feature of Martin's scheme.

Unless I'm missing something, there _are_no_puns_ between funny-encoded
strings and well formed unicode strings.

 I suppose if your program carefully constructs a unicode string riddled
 with half-surrogates etc and imagines something specific should happen
 to them on the way to being POSIX bytes then you might have a problem...
   
 Right.  Or someone else's program does that.

I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a
coffee, reading section 3.9 (Unicode Encoding Forms).

I now do not believe your scenario makes sense.

Someone can construct a Python3 string containing code points that
includes surrogates. Granted.

However such a string is not meaningful because it is not well-formed
(D85).  It's ill-formed (D84). It is not sane to expect it to
translate into a POSIX byte sequence, be it UTF-8 or anything else,
unless it is accompanied by some kind of explicit mapping provided by
the programmer.  Absent that mapping, it's nonsense in much the same
way that a non-decodable UTF-8 byte sequence is nonsense.

For example, Martin's funny-encoding is such an explicit mapping.

I only want to use 
 Unicode  file names.  But if those other file names exist, I want to 
 be able to  access them, and not accidentally get a different file.

But those other names _don't_ exist.

 Also, by avoiding reuse of legitimate characters in the encoding we can

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 2:01 PM, came the following characters from 
the keyboard of MRAB:

Glenn Linderman wrote:
On approximately 4/28/2009 11:55 AM, came the following characters 
from the keyboard of MRAB:

I've been thinking of python-escape only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).



OK, I won't be offended if you don't call them 'escape codes'. :)  But 
what else to call them?


My use of that term is a bit backwards, perhaps... what happens is that 
because these 256 half surrogates are used to decode otherwise 
undecodable bytes, they themselves must be escaped or translated into 
something different, when they appear in the byte sequence.  The process 
 described reserves a set of codepoints for use, and requires that that 
same set of codepoints be translated using a similar mechanism to avoid 
their untranslated appearance in the resulting str.  Escape codes have 
the same sort of characteristic... by replacing their normal use for 
some other use, they must themselves have a replacement.


Anyway, I think we are communicating successfully.



2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean excluding instead of including?


Perhaps I should've said Any codepoint which can't be produced by
decoding should raise an exception.



Yes, your rephrasing is clearer, regarding your intention.



For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.



Decoding with UTF-8b might never produce U+DC00, but then again, it 
won't handle the random byte string, either.




I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 
to U+01FF (see rule 3 for why).  Only one escape codepoint is needed, 
this is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a 
bytes interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint 
followed by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape 
codepoints not followed by the escape codepoint, or by a codepoint in 
the range U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding 
as would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and 
escape characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an 
annoyance with the PEP, that the codepoints used are not ones that are 
easily displayed, so endecodable names could easily result in long 
strings of indistinguishable

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi

Zooko O'Whielacronx wrote:
 On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:
 If you switch to iso8859-15 only in the presence of undecodable UTF-8,
 then you have the same round-trip problem as the PEP: both b'\xff' and
 b'\xc3\xbf' will be converted to u'\u00ff' without a way to
 unambiguously recover the original file name.
 
 Why do you say that?  It seems to work as I expected here:
 
 '\xff'.decode('iso-8859-15')
 u'\xff'
 '\xc3\xbf'.decode('iso-8859-15')
 u'\xc3\xbf'



 '\xff'.decode('cp1252')
 u'\xff'
 '\xc3\xbf'.decode('cp1252')
 u'\xc3\xbf'
 

You're not showing that this is a fallback path.  What won't work is
first trying a local encoding (in the following example, utf-8) and then
if that doesn't work, trying a one-byte encoding like iso8859-15:

try:
file1 = '\xff'.decode('utf-8')
except UnicodeDecodeError:
file1 = '\xff'.decode('iso-8859-15')
print repr(file1)

try:
file2 = '\xc3\xbf'.decode('utf-8')
except UnicodeDecodeError:
file2 = '\xc3\xbf'.decode('iso-8859-15')
print repr(file2)


That prints:
  u'\xff'
  u'\xff'

The two encodings can map different bytes to the same unicode code point
 so you can't do this type of thing without recording what encoding was
used in the translation.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread R. David Murray


On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk with the 
byte that translates to the same surrogate, accessed via the bytes interface. 
Ambiguity.


Unless I'm missing something, one of these is type str, and the other is 
type bytes, so no ambiguity.


--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

On 28Apr2009 14:37, Thomas Breuel tmb...@gmail.com wrote:
| But the biggest problem with the proposal is that it isn't needed: if you
| want to be able to turn arbitrary byte sequences into unicode strings and
| back, just set your encoding to iso8859-15.  That already works and it
| doesn't require any changes.

No it doesn't. It does transcode without throwing exceptions. On POSIX.
(On Windows? I doubt it - windows isn't using an 8-bit scheme. I
believe.) But it utter destorys any hope of working in any other locale
nicely. The PEP lets you work losslessly in other locales.

It _may_ require some app care for particular very weird strings
that don't come from the filesystem, but as far as I can see only in
circumstances where such care would be needed anyway i.e. you've got to
do special stuff for weirdness in the first place. Weird == ill-formed
unicode string here.

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

I just kept it wide-open thinking it would correct itself.
Then I ran out of talent.   - C. Fittipaldi
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi

Martin v. Löwis wrote:
 Since the serialization of the Unicode string is likely to use UTF-8,
 and the string for  such a file will include half surrogates, the
 application may raise an exception when encoding the names for a
 configuration file. These encoding exceptions will be as rare as the
 unusual names (which the careful I18N aware developer has probably
 eradicated from his system), and thus will appear late.
 
 There are trade-offs to any solution; if there was a solution without
 trade-offs, it would be implemented already.
 
 The Python UTF-8 codec will happily encode half-surrogates; people argue
 that it is a bug that it does so, however, it would help in this
 specific case.

Can we use this encoding scheme for writing into files as well?  We've
turned the filename with undecodable bytes into a string with half
surrogates.  Putting that string into a file has to turn them into bytes
at some level.  Can we use the python-escape error handler to achieve
that somehow?

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

On 28Apr2009 13:37, Glenn Linderman v+pyt...@g.nevcal.com wrote:
 On approximately 4/28/2009 1:25 PM, came the following characters from  
 the keyboard of Martin v. Löwis:
 The UTF-8b representation suffers from the same potential ambiguities as
 the PUA characters... 

 Not at all the same ambiguities. Here, again, the two choices:

 A. use PUA characters to represent undecodable bytes, in particular for
UTF-8 (the PEP actually never proposed this to happen).
This introduces an ambiguity: two different files in the same
directory may decode to the same string name, if one has the PUA
character, and the other has a non-decodable byte that gets decoded
to the same PUA character.

 B. use UTF-8b, representing the byte will ill-formed surrogate codes.
The same ambiguity does *NOT* exist. If a file on disk already
contains an invalid surrogate code in its file name, then the UTF-8b
decoder will recognize this as invalid, and decode it byte-for-byte,
into three surrogate codes. Hence, the file names that are different
on disk are also different in memory. No ambiguity.

 C. File on disk with the invalid surrogate code, accessed via the str  
 interface, no decoding happens, matches in memory the file on disk with  
 the byte that translates to the same surrogate, accessed via the bytes  
 interface.  Ambiguity.

Is this a Windows example, or (now I think on it) an equivalent POSIX example
of using the PEP where the locale encoding is UTF-16?

In either case, I would say one could make an argument for being stricter
in reading in OS-native sequences. Grant that NTFS doesn't prevent
half-surrogates in filenames, and likewise that POSIX won't because to
the OS they're just bytes. On decoding, require well-formed data. When
you hit ill-formed data, treat the nasty half surrogate as a PAIR of
bytes to be escaped in the resulting decode.

Ambiguity avoided.

I'm more concerned with your (yours? someone else's?) mention of shift
characters. I'm unfamiliar with these encodings: to translate such a
thing into a Latin example, is it the case that there are schemes with
valid encodings that look like:

  [SHIFT] a b c

which would produce ABC in unicode, which is ambiguous with:

  A B C

which would also produce ABC?

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

Helicopters are considerably more expensive [than fixed wing aircraft],
which is only right because they don't actually fly, but just beat
the air into submission.- Paul Tomblin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 7:40 PM, came the following characters from 
the keyboard of R. David Murray:

On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk 
with the byte that translates to the same surrogate, accessed via the 
bytes interface. Ambiguity.


Unless I'm missing something, one of these is type str, and the other is 
type bytes, so no ambiguity.



You are missing that the bytes value would get decoded to a str; thus 
both are str; so ambiguity is possible.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 4:06 PM, came the following characters from 
the keyboard of Cameron Simpson:

I think I may be able to resolve Glenn's issues with the scheme lower
down (through careful use of definitions and hand waving).
  


Close.  You at least resolved what you thought my issue was.  And, you 
did make me more comfortable with the idea that I, in programs I write, 
would not be adversely affected by the PEP if implemented.  While I can 
see that the PEP no doubt solves the os.listdir / open problem on POSIX 
systems for Python 3 + PEP programs that don't use 3rd party libraries, 
it does require programs that do use 3rd party libraries to be recoded 
with your functions -- which so far the PEP hasn't embraced.  Or, to use 
the bytes APIs directly to get file names for 3rd party libraries -- but 
the directly ported, filenames-as-strings type of applications that 
could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked 
to do something different than they did before.




On 27Apr2009 23:52, Glenn Linderman v+pyt...@g.nevcal.com wrote:
  
On approximately 4/27/2009 7:11 PM, came the following characters from  
the keyboard of Cameron Simpson:


[...]
  

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) - funny-encoded Unicode
This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) - bytes
This is what open(filename,..) does to turn the filename into bytes
for the POSIX open.
  os.pathencode(your-string) - funny-encoded-Unicode
This is what you must do to a de novo string to turn it into a
string suitable for use by open.
Importantly, for most strings not hand crafted to have weird
sequences in them, it is a no-op. But it will recode your puns
for survival.
  

[...]
  
So assume a non-decodable sequence in a name.  That puts us into   
Martin's funny-decode scheme.  His funny-decode scheme produces a 
bare  string, indistinguishable from a bare string that would be 
produced by a  str API that happens to contain that same sequence.  
Data puns.



See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.
_Lacking_ such a function, your punning concern is valid.
  

Seems like one would also desire os.pathdecode to do the reverse.



Yes.

  
And  
also versions that take or produce bytes from funny-encoded strings.



Isn't that the first two functions above?
  


Yes, sorry.

Then, if programs were re-coded to perform these transformations on what  
you call de novo strings, then the scheme would work.
But I think a large part of the incentive for the PEP is to try to  
invent a scheme that intentionally allows for the puns, so that programs  
do not need to be recoded in this manner, and yet still work.  I don't  
think such a scheme exists.



I agree no such scheme exists. I don't think it can, just using strings.

But _unless_ you have made a de novo handcrafted string with
ill-formed sequences in it, you don't need to bother because you
won't _have_ puns. If Martin's using half surrogates to encode
undecodable bytes, then no normal string should conflict because a
normal string will contain _only_ Unicode scalar values. Half surrogate
code points are not such.

The advantage here is that unless you've deliberately constructed an
ill-formed unicode string, you _do_not_ need to recode into
funny-encoding, because you are already compatible. Somewhat like one
doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.
  


Right.  And I don't intend to generate ill-formed Unicode strings, in my 
programs.  But I might well read their names from other sources.


It is nice, and thank you for emphasizing (although I already did 
realize it, back there in the far reaches of the brain) that all the 
data puns are between ill-formed Unicode strings, and undecodable bytes 
strings.  That is a nice property of the PEP's encoding/decoding 
method.  I'm not sure it outweighs the disadvantage of taking unreadable 
gibberish, and producing indecipherable gibberish (codepoints with no 
glyphs), though, when there are ways to produce decipherable gibberish 
instead... or at least mostly-decipherable gibberish.  Another idea 
forms described below.


If there is going to be a required transformation from de novo strings  
to funny-encoded strings, then why not make one that people can actually  
see and compare and decode from the displayable form, by using  
displayable

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

 C. File on disk with the invalid surrogate code, accessed via the str
 interface, no decoding happens, matches in memory the file on disk with
 the byte that translates to the same surrogate, accessed via the bytes
 interface.  Ambiguity.

 Is that an alternative to A and B?
 
 I guess it is an adjunct to case B, the current PEP.
 
 It is what happens when using the PEP on a system that provides both
 bytes and str interfaces, and both get used.

Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.

By accessed via the str interface, I assume you do something like

  fn = some string
  open(fn)

You are wrong in assuming no decoding happens, and that matches
in memory the file on disk (whatever that means - how do I match
a file on disk in memory??). What happens instead is that fn
gets *encoded* with the file system encoding, and the python-escape
handler. This will *not* produce an ambiguity.

If you think there is an ambiguity in that you can use both the
byte interface and the string interface to access the same file:
this would be a ridiculous interpretation. *Of course* you can
access /etc/passwd both as /etc/passwd and b/etc/passwd,
there is nothing ambiguous about that.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman

On approximately 4/25/2009 5:35 AM, came the following characters from 
the keyboard of Martin v. Löwis:

Because the encoding is not reliably reversible.


Why do you say that? The encoding is completely reversible
(unless we disagree on what reversible means).


I'm +1 on the concept, -1 on the PEP, due solely to the lack of a
reversible encoding.


Then please provide an example for a setup where it is not reversible.

Regards,
Martin


It is reversible if you know that it is decoded, and apply the encoding. 
 But if you don't know that has been encoded, then applying the reverse 
transform can convert an undecoded str that matches the decoded str to 
the form that it could have, but never did take.


The problem is that there is no guarantee that the str interface 
provides only strictly conforming Unicode, so decoding bytes to 
non-strictly conforming Unicode, can result in a data pun between 
non-strictly conforming Unicode coming from the str interface vs bytes 
being decoded to non-strictly conforming Unicode coming from the bytes 
interface.


Any particular problem that always consistently uses one or the other 
(bytes vs str) APIs under the covers might never be affected by such a 
data pun, but programs that may use both types of interface could 
potentially see a data pun.


If your PEP depends on consistent use of one or the other type of 
interface, you should say so, and if the platform only provides that 
type of interface, maybe all is well.  Both types of interfaces are 
available on Windows, perhaps POSIX only provides native bytes 
interfaces, and if the PEP is the only way to provide str interfaces, 
then perhaps consistency use is required.


There are still issues regarding how Windows and POSIX programs that are 
sharing cross-mounted file systems might communicate file names between 
each other, which is not at all clear from the PEP.  If this is an 
insoluble or un-addressed issue, it should be stated.  (It is probably 
insoluble, due to there being multiple ways that the cross-mounted file 
systems might translate names; but if there are, can we learn something 
from the rules the mounting systems use, to be compatible with (one of) 
them, or not.


Together with your change to avoid using PUA characters, and the rule 
suggested by MRAB in another branch of this thread, of treating 
half-surrogates as invalid byte sequences may avoid the data puns I'm 
concerned about.


It is not clear how half-surrogate characters would be displayed, when 
the user prints or displays such a file name string.  It would seem that 
programs that display file names to users might still have issues with 
such; an escaping mechanism that uses displayable characters would have 
an advantage there.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman

On approximately 4/25/2009 5:22 AM, came the following characters from 
the keyboard of Martin v. Löwis:

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.


Why is it necessary that you are able to make this distinction?



It is necessary that programs (not me) can make the distinction, so that 
it knows whether or not to do the funny-encoding or not.  If a name is 
funny-decoded when the name is accessed by a directory listing, it needs 
to be funny-encoded in order to open the file.




Picking a character (I don't find U+F01xx in the
Unicode standard, so I don't know what it is)


It's a private use area. It will never carry an official character
assignment.



I know that U+F - U+F is a private use area.  I don't find a 
definition of U+F01xx to know what the notation means.  Are you picking 
a particular character within the private use area, or a particular 
range, or what?




As I realized in the email-sig, in talking about decoding corrupted
headers, there is only one way to guarantee this... to encode _all_
character sequences, from _all_ interfaces.  Basically it requires
reserving an escape character (I'll use ? in these examples -- yes, an
ASCII question mark -- happens to be illegal in Windows filenames so
all the better on that platform, but the specific character doesn't
matter... avoiding / \ and . is probably good, though).


I think you'll have to write an alternative PEP if you want to see
something like this implemented throughout Python.



I'm certainly not experienced enough in Python development processes or 
internals to attempt such, as yet.  But somewhere in 25 years of 
programming, I picked up the knowledge that if you want to have a 1-to-1 
reversible mapping, you have to avoid data puns, mappings of two 
different data values into a single data value.  Your PEP, as first 
written, didn't seem to do that... since there are two interfaces from 
which to obtain data values, one performing a mapping from bytes to 
funny invalid Unicode, and the other performing no mapping, but 
accepting any sort of Unicode, possibly including funny invalid 
Unicode, the possibility of data puns seems to exist.  I may be 
misunderstanding something about the use cases that prevent these two 
sources of funny invalid Unicode from ever coexisting, but if so, 
perhaps you could point it out, or clarify the PEP.  I'll try to reread 
it again... could you post a URL to the most up-to-date version of the 
PEP, since I haven't seen such appear here, and the version I found via 
a Google search seems to be the original?



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson

On 26Apr2009 23:39, Glenn Linderman v+pyt...@g.nevcal.com wrote:
[...snip...]
 There are still issues regarding how Windows and POSIX programs that are  
 sharing cross-mounted file systems might communicate file names between  
 each other, which is not at all clear from the PEP.  If this is an  
 insoluble or un-addressed issue, it should be stated.  (It is probably  
 insoluble, due to there being multiple ways that the cross-mounted file  
 systems might translate names; but if there are, can we learn something  
 from the rules the mounting systems use, to be compatible with (one of)  
 them, or not.

I'd say that's out of scope. A windows filesystem mounted on a UNIX host
should probably be mounted with a mapping to translate the Windows
Unicode names into whatever the sysadmin deems the locally most apt
byte encoding. But sys.getfilesystemencoding() is based on the current user's
locale settings, which need not be the same.

 Together with your change to avoid using PUA characters, and the rule  
 suggested by MRAB in another branch of this thread, of treating  
 half-surrogates as invalid byte sequences may avoid the data puns I'm  
 concerned about.

 It is not clear how half-surrogate characters would be displayed, when  
 the user prints or displays such a file name string.  It would seem that  
 programs that display file names to users might still have issues with  
 such; an escaping mechanism that uses displayable characters would have  
 an advantage there.

Wouldn't any escaping mechanism that uses displayable characters
require visually mangling occurences of those characters that
legitimately occur in the original?
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman

On approximately 4/27/2009 12:55 AM, came the following characters from 
the keyboard of Cameron Simpson:

On 26Apr2009 23:39, Glenn Linderman v+pyt...@g.nevcal.com wrote:
[...snip...]
  
There are still issues regarding how Windows and POSIX programs that are  
sharing cross-mounted file systems might communicate file names between  
each other, which is not at all clear from the PEP.  If this is an  
insoluble or un-addressed issue, it should be stated.  (It is probably  
insoluble, due to there being multiple ways that the cross-mounted file  
systems might translate names; but if there are, can we learn something  
from the rules the mounting systems use, to be compatible with (one of)  
them, or not.



I'd say that's out of scope. A windows filesystem mounted on a UNIX host
should probably be mounted with a mapping to translate the Windows
Unicode names into whatever the sysadmin deems the locally most apt
byte encoding. But sys.getfilesystemencoding() is based on the current user's
locale settings, which need not be the same.
  


And if it were, what would it do with files that can't be encoded with 
the locally most apt byte encoding?  That's where we might learn 
something about what behaviors are deemed acceptable.  Would such files 
be inaccessible?  Accessible with mangled names?  or what?


And for a Unix filesystem mounted on a Windows host?  Or accessed via 
some network connection?



Together with your change to avoid using PUA characters, and the rule  
suggested by MRAB in another branch of this thread, of treating  
half-surrogates as invalid byte sequences may avoid the data puns I'm  
concerned about.


It is not clear how half-surrogate characters would be displayed, when  
the user prints or displays such a file name string.  It would seem that  
programs that display file names to users might still have issues with  
such; an escaping mechanism that uses displayable characters would have  
an advantage there.



Wouldn't any escaping mechanism that uses displayable characters
require visually mangling occurences of those characters that
legitimately occur in the original?
  


Yes.  My suggested use of ? is a visible character that is illegal in 
Windows file names, thus causing no valid Windows file names to be 
visually mangled.  It is also a character that should be avoided in 
POSIX names because:


1) it is known to be illegal on Windows, and thus non-portable
2) it is hard to write globs that match ? without allowing matches of 
other characters as well

3) it must be quoted to specify it on a command line

That said, someone provided a case where it is easy to get ? in POSIX 
file names.  The remaining question is whether that is a reasonable use 
case, a frequent use case, or a stupid use case; and whether the 
resulting visible mangling is more or less understandable and disruptive 
than using half-surrogates which are:


1) invalid Unicode
2) non-displayable
3) indistinguishable using normal non-displayable character substitution 
rules


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Antoine Pitrou

Stephen J. Turnbull stephen at xemacs.org writes:
 
 If
 you see a broken encoding once, you're likely to see it a million times
 (spammers have the most broken software) or maybe have it raise an
 unhandled Exception a dozen times (in rate of using busted software,
 the spammers are closely followed by bosses---which would be very bad,
 eh, if you 2/3 of the mail from your boss ends up in an undeliverables
 queue due to encoding errors that are unhandled by your some filter in
 your mail pipeline).

I'm not sure how mail being stuck in a pipeline has anything to do with Martin's
proposal (which deals with file paths, not with SMTP...).
Besides, I don't care about spammers and their broken software.

 Again, that's not the point.  The point is that six-sigma reliability
 world-wide is not going to be very comforting to the poor souls who
 happen to have broken software in their environment sending broken
 encodings regularly, because they're going to be dealing with one or
 two sigmas, and that's just not good enough in a production
 environment.

So you're arguing that whatever solution which isn't 100% perfect but only
99.999% perfect shouldn't be implemented at all, and leave the status quo at
98%? This sounds disturbing to me.

(especially given you probably sent this mail using TCP/IP...)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull

Antoine Pitrou writes:

  I'm not sure how mail being stuck in a pipeline has anything to do
  with Martin's proposal (which deals with file paths, not with
  SMTP...).

I hate to break it to you, but most stages of mail processing have
very little to do with SMTP.  In particular, processing MIME
attachments often requires dealing with file names.  Would practical
problems arise?  I expect they would.  Can I tell you what they are?
No; if I could I'd write a better PEP.  I'm just saying that my
experience is that Murphy's Law applies more to encoding processing
than any other area of software I've worked in (admittedly, I don't do
threads ;-).

  Besides, I don't care about spammers and their broken software.

That's precisely my point.  The PEP's solution will be very
appealing to people who just don't care as long as it works for them,
in the subset of corner cases they happen to encounter.  A lot of
software, including low-level components, will be written using these
APIs, and they will result in escapes of uninterpreted bytes (encoded
as Unicode) into the textual world.

  So you're arguing that whatever solution which isn't 100% perfect
  but only 99.999% perfect shouldn't be implemented at all, and leave
  the status quo at 98%?

No, I'm not talking about whatever solution.  I'm only arguing about
PEP 383.  The point is that Martin's proposal is not just a solution
to the problem he posed.  It's also going to be the one obvious way to
make the usual mistakes, i.e., the return values will escape into code
paths they're not intended for.  And the APIs won't be killable until
Python 4000.  If we find a better way (which I think Python 3's move
to text is Unicode is likely to inspire!), we'll have to wait 10-15
years or more before it becomes the OOWTDI.  The only real hope about
that is that Unicode will become universal before that, and only
archaeologists will ever encounter malformed text.

I believe there are solutions that don't have that problem.
Specifically, if the return values were bytes, or (better for 2.x,
where bytes are strings as far as most programmers are concerned) as a
new data type, to indicate that they're not text until the client
acknowledges them as such.  EIBTI.

Unfortunately, Martin clearly doesn't intend to make such a change to
the PEP.  I don't have the time or the Python expertise to generate an
alternative PEP. :-(  I do have long experience with the pain of
dealing with encoding issues caused by APIs that are intended to DTRT,
conveniently.  Martin's is better than most, but I just don't think
convenience and robustness can be combined in this area.

  This sounds disturbing to me.

BTW, I'm on record as +0 on the PEP.  I don't think the better
proposals have a chance, because most people *want* the non-solution
that they can just use as a habit, allowing Python to make decisions
that should be made by the application, and not have to do
unnecessary conversions and the like.  It's not obvious to me that
it should not be given to them, but I don't much like it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Paul Moore

2009/4/27 Stephen J. Turnbull step...@xemacs.org:
 I believe there are solutions that don't have that problem.
 Specifically, if the return values were bytes, or (better for 2.x,
 where bytes are strings as far as most programmers are concerned) as a
 new data type, to indicate that they're not text until the client
 acknowledges them as such.  EIBTI.

I think you're ignoring the fact that under Windows, it's the *bytes*
APIs that are lossy.

Can I at least assume that you aren't recommending that only the bytes
API exists on Unix, and only the Unicode API on Windows?

So what's your suggestion?

 Unfortunately, Martin clearly doesn't intend to make such a change to
 the PEP.  I don't have the time or the Python expertise to generate an
 alternative PEP. :-(  I do have long experience with the pain of
 dealing with encoding issues caused by APIs that are intended to DTRT,
 conveniently.  Martin's is better than most, but I just don't think
 convenience and robustness can be combined in this area.

The *only* robust solution is to completely separate the 2
platforms. Which helps no-one, and is at least as bad as the 2.x
situation. (Probably worse).

 BTW, I'm on record as +0 on the PEP.  I don't think the better
 proposals have a chance, because most people *want* the non-solution
 that they can just use as a habit, allowing Python to make decisions
 that should be made by the application, and not have to do
 unnecessary conversions and the like.  It's not obvious to me that
 it should not be given to them, but I don't much like it.

People *want* a solution that doesn't require every application
developer to sweat blood to write working code, simply to cover corner
cases that they don't believe will happen. Not every application is a
24x7 server, and all that. Similarly, not every application is a
backup program. Such applications have unique issues, which the
developers should (but don't always, admittedly!) understand. The rest
of us don't want to be made to care.

It's not sloppiness. It's a realistic appreciation of the requirements
of the application. (And an acceptance that not every bug must be
fixed before release).

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

1 2 >

1 - 100 of 199 matches

Mail list logo