subject:"Re\: \[Python\-Dev\] PEP 383 update\: utf8b is now the error handler"

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-08 Thread Stephen J. Turnbull

M.-A. Lemburg writes:

  I'd use allowlonesurrogates as name for the surrogates error
  handler and lonesurrogatereplace for the utf8b one.

+1
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-08 Thread Walter Dörwald

Stephen J. Turnbull wrote:
 Walter Dörwald writes:
 
   surrogatepass (for the don't complain about lone half surrogates
   handler) and surrogatereplace sound OK to me. However the other
   ...replace handlers are destructive (i.e. when such a ...replace
   handler is used for encoding, decoding will not produce the original
   unicode string).
 
 That doesn't bother me in the slightest.  Replace does not connote
 destructive or non-destructive to me; it connotes substitution.
 The fact that other error handlers happen to be destructive doesn't
 affect that at all for me.  YMMV.
 
   The purpose of the PEP 383 error handler however is to be roundtrip
   safe, so maybe we should choose a slightly different name?  How
   about surrogateescape?
 
 To me, escape has a strong connotation of a multicharacter
 representation of a single character, and that's not true here.
 
 How about surrogatetranslate?  I still prefer surrogatereplace, as
 it's slightly easier for me to type.

I like surrogatetranslate better than surrogateescape better than
surrogatereplace.

But I'll stop bikesheding now and let Martin decide.

Servus,
   alter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis

 By the way, what are the ASCII characters that are not suppported by 
 Shift-JIS?
 Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
 backslash and the tilde).

The problem with this encoding is that bytes below 128 appear as second
bytes of a two-byte encoding:

py \x81@.decode(shift-jis)
u'\u3000'
py \x81A.decode(shift-jis)
u'\u3001'

So in on decoding, it may be the second byte (i.e. the ASCII byte) that
causes a problem:

py \x81/.decode(shift-jis)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position
0-1: illegal multibyte sequence

For the shift-jis codec, that's actually not a problem, though:

py b\x81/.decode(shift-jis,utf8b)
'\udc81/'

so the utf8b error handler will escape the first of the two bytes,
and then pass the second byte to the codec again, which then decodes
as ASCII.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis

 So are you proposing that I should rename the PEP 383 handler
 to utf_8b_encoder_invalid_codepoints?
 
 
 No, he's saying that your algorithm for choosing the PEP 383 handler
 should have come up with that name, rather than utf8b.  But since PEP
 383 applies to other codecs besides UTF-8, it should have a different
 name.  And one that is less cumbersome than
 utf_8b_encoder_invalid_codepoints

I'm still at a loss what name to give it, though. I understand that
I have to rename both error handlers, but I'm uncertain what I should
rename them to. So proposals that rename only one of them aren't
that helpful. It would be helpful if people would indicate support
for Antoine's proposal.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Glenn Linderman

On approximately 5/6/2009 10:53 PM, came the following characters from 
the keyboard of Martin v. Löwis:

The error handler designed with utf-8 in mind has no name in the encode
direction and is called utf_8b_decoder_invalid_bytes in the decode
direction.  By your reasoning, *that* should be its name in Python.  The
encoding error handler would then be named analogously
utf_8b_encoder_invalid_codepoints.  Even these, to me, would be better
than confusing giving them the same name as the codec.


So are you proposing that I should rename the PEP 383 handler
to utf_8b_encoder_invalid_codepoints?



No, he's saying that your algorithm for choosing the PEP 383 handler 
should have come up with that name, rather than utf8b.  But since PEP 
383 applies to other codecs besides UTF-8, it should have a different 
name.  And one that is less cumbersome than 
utf_8b_encoder_invalid_codepoints


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis

 Wouldn't renaming the existing surrogates handler be an incompatible
 change, and thus inappropriate?

No - it's new in Python 3.1.

So what do you think about Antoine's proposal?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Glenn Linderman

On approximately 5/6/2009 11:16 PM, came the following characters from 
the keyboard of Martin v. Löwis:

So are you proposing that I should rename the PEP 383 handler
to utf_8b_encoder_invalid_codepoints?


No, he's saying that your algorithm for choosing the PEP 383 handler
should have come up with that name, rather than utf8b.  But since PEP
383 applies to other codecs besides UTF-8, it should have a different
name.  And one that is less cumbersome than
utf_8b_encoder_invalid_codepoints


I'm still at a loss what name to give it, though. I understand that
I have to rename both error handlers, but I'm uncertain what I should
rename them to. So proposals that rename only one of them aren't
that helpful. It would be helpful if people would indicate support
for Antoine's proposal.



Wouldn't renaming the existing surrogates handler be an incompatible 
change, and thus inappropriate?  I assume that is the second handler you 
are referring to?



bytes-as-lone-surrogates

That would be very descriptive of the decode case for PEP 383, but very 
long.  One problem with the word surrogates is that anything you add 
to it makes it too long.


bytes-ls

This is short, but a meaningless as is -- however, adding the 
understanding via documentation that ls means lone surrogates would 
make it meaningful, and mnemonic.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Walter Dörwald

M.-A. Lemburg wrote:
 Antoine Pitrou wrote:
 Martin v. Löwis martin at v.loewis.de writes:
 py b'\xed\xa0\x80'.decode(utf-8,surrogates)
 '\ud800'
 The point is, surrogates does not mean anything intuitive for an /error
 handler/. You seem to be the only one who finds this name explicit enough,
 perhaps because you chose it.
 Most other handlers' names have verbs in them (ignore, replace,
 xmlcharrefreplace, etc.).
 
 Correct.
 
 The purpose of an error handler name is to indicate to the user
 what it does, hence the use of verbs.
 
 Walter started with xmlcharrefreplace, ie. no space names, so
 surrogatereplace would be the logically correct name for the
 replace with lone surrogates scheme invented by Markus Kuhn.

surrogatepass (for the don't complain about lone half surrogates
handler) and surrogatereplace sound OK to me. However the other
...replace handlers are destructive (i.e. when such a ...replace
handler is used for encoding, decoding will not produce the original
unicode string). The purpose of the PEP 383 error handler however is to
be roundtrip safe, so maybe we should choose a slightly different name?
How about surrogateescape?

 The error handler for undoing this operation (ie. when converting
 a Unicode string to some other encoding) should probably use the
 same name based on symmetry and the fact that the escaping
 scheme is meant to be used for enabling round-trip safety.

We have only one error handler registry, but we *can* have one error
handler for both directions (encoding and decoding) as the error handler
can simply check whether it got passed a UnicodeEncodeError or
UnicodeDecodeError object.

 BTW: It would also be appropriate to reference Markus Kuhn in the PEP
 as the inventor of the escaping scheme.
 
 Even if only to give the reader an idea of how that scheme works and
 why (the PEP on python.org currently doesn't explain this).
 
 It should also explain that the scheme is meant to assure round-trip
 safety and doesn't necessarily work when using transcoding, ie.
 reading using one encoding, writing using another.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread MRAB


Martin v. Löwis wrote:

Wouldn't renaming the existing surrogates handler be an incompatible
change, and thus inappropriate?


No - it's new in Python 3.1.

So what do you think about Antoine's proposal?


+1

Although it looks like it would be without the '-' for consistency with
existing error handlers.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Michael Urman

On Thu, May 7, 2009 at 00:43, Martin v. Löwis mar...@v.loewis.de wrote:
 Michael Urman wrote:
 On Wed, May 6, 2009 at 15:42, Martin v. Löwis mar...@v.loewis.de wrote:
 Despite there being also an error handler called surrogates.

 Not that I have to be, but I'm not sold on the previous UTF-8 codec
 behavior becoming an error handler of the name surrogates for two
 reasons (I do respect the obvious PBP argument for the implementation,
 and have no better name - lenient?).

 PBP?

Practicality beats purity. From a purity standpoint, the legacy
invalid utf-8 seems more like an encoding than an error handler to me.
From a practicality standpoint, it's presumably much more convenient
to implement it on top of the new valid UTF-8 codec's behavior. And
then any error handler needs a name.

 Well, there is a way to stack error handlers, although it's not pretty:
 [...]
 codecs.register_error(surrogates_then_replace,
                      surrogates_then_replace)

That mitigates my arguments significantly, although I'd rather see
something like errors=('surrogates', 'replace') chain the handlers
without additional registrations. But that's a different PEP or
arbitrary change. :)

 The stacking argument also applies to the new utf8b behavior on encode
 (only, as it handles all errors on decode). This may be a YAGNI

 Indeed - in particular, as, in the primary application of this error
 handler (i.e. file IO operations), there is no way of specifying
 an addition error handler anyway.

Would it be useful to allow setting this somewhere? It'd be analogous
to setfsencoding, perhaps a setfsencodingerrors. It's not hard to
imagine an application working on Windows where all Unicode characters
are valid, and constructing backup filenames by adding some arbitrary
character, or receiving them from a user who doesn't understand
encodings. When this application is taken to a non-Unicode filesystem,
without the ability to say I really want a valid filename: so
replace, that could get messy. But it may still be a YAGNI, or a
don't do that.

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Michael Urman

On Thu, May 7, 2009 at 01:16, Martin v. Löwis mar...@v.loewis.de wrote:
 I'm still at a loss what name to give it, though. I understand that
 I have to rename both error handlers, but I'm uncertain what I should
 rename them to. So proposals that rename only one of them aren't
 that helpful. It would be helpful if people would indicate support
 for Antoine's proposal.

Part of the problem is they both allow byte sequences to decode to
invalid Unicode strings, and in particular they both affect the same
byte subsequences, and that brought us to the crossroads where we
wanted to name both of them surrogates. So I'll offer a few more
colors, and try to get out of the way of choosing between them or the
other proposed ones. :)

I haven't come up with anything I like better than errors=lenient
for the old utf8 behavior handler; would errors=nonvalidating be
correct? It still seems to me that a new codec, perhaps
utf8-lenient, reads better.

For the utf8b error handler, I could see any of errors=roundtrip,
errors=roundtripreplace, errors=tosurrogate,
errors=surrogatereplace, errors=surrogateescape,
errors=binaryreplace, errors=binaryescape. This includes Antoine's
proposal (sans hyphen).

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Walter Dörwald

Michael Urman wrote:

 [...]
 Well, there is a way to stack error handlers, although it's not pretty:
 [...]
 codecs.register_error(surrogates_then_replace,
  surrogates_then_replace)
 
 That mitigates my arguments significantly, although I'd rather see
 something like errors=('surrogates', 'replace') chain the handlers
 without additional registrations. But that's a different PEP or
 arbitrary change. :)

The first version of PEP 293 changed the errors argument to be a string
or callable. This would have simplified handler stacking somewhat
(because you don't have to register or lookup handlers) but it had the
disadvantage that many char * arguments in the C API would have had to
changed to PyObject *. Changing the errors argument to a list of
strings would have the same problem.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread MRAB


Walter Dörwald wrote:

Michael Urman wrote:


[...]

Well, there is a way to stack error handlers, although it's not pretty:
[...]
codecs.register_error(surrogates_then_replace,
 surrogates_then_replace)

That mitigates my arguments significantly, although I'd rather see
something like errors=('surrogates', 'replace') chain the handlers
without additional registrations. But that's a different PEP or
arbitrary change. :)


The first version of PEP 293 changed the errors argument to be a string
or callable. This would have simplified handler stacking somewhat
(because you don't have to register or lookup handlers) but it had the
disadvantage that many char * arguments in the C API would have had to
changed to PyObject *. Changing the errors argument to a list of
strings would have the same problem.


A comma-separated or space-separated string, eg 'surrogates replace' or
'surrogates,replace'? It could be treated as handler stacking
internally.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis

 Well, there is a way to stack error handlers, although it's not pretty:
 [...]
 codecs.register_error(surrogates_then_replace,
  surrogates_then_replace)
 
 That mitigates my arguments significantly, although I'd rather see
 something like errors=('surrogates', 'replace') chain the handlers
 without additional registrations. But that's a different PEP or
 arbitrary change. :)

I think you can provide something like

errors=combine_errors('surrogates', 'replace')

as a library function, and it doesn't have to be part of the
standard library.

 The stacking argument also applies to the new utf8b behavior on encode
 (only, as it handles all errors on decode). This may be a YAGNI
 Indeed - in particular, as, in the primary application of this error
 handler (i.e. file IO operations), there is no way of specifying
 an addition error handler anyway.
 
 Would it be useful to allow setting this somewhere?

I'm deliberately not proposing this as part of the PEP. First, it
has enough features already, and is approved as-is; plus YAGNI.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis

 I haven't come up with anything I like better than errors=lenient
 for the old utf8 behavior handler; would errors=nonvalidating be
 correct?

I think either is fairly unspecific.

 For the utf8b error handler, I could see any of errors=roundtrip,
 errors=roundtripreplace, errors=tosurrogate,
 errors=surrogatereplace, errors=surrogateescape,
 errors=binaryreplace, errors=binaryescape. This includes Antoine's
 proposal (sans hyphen).

Giving multiple choices does not exactly make this proposal readily
implementable :-)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis

 The error handler for undoing this operation (ie. when converting
 a Unicode string to some other encoding) should probably use the
 same name based on symmetry and the fact that the escaping
 scheme is meant to be used for enabling round-trip safety.

Could you please familiarize yourself with the implementation
before commenting further?

Thanks,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Stephen J. Turnbull

Walter Dörwald writes:

  surrogatepass (for the don't complain about lone half surrogates
  handler) and surrogatereplace sound OK to me. However the other
  ...replace handlers are destructive (i.e. when such a ...replace
  handler is used for encoding, decoding will not produce the original
  unicode string).

That doesn't bother me in the slightest.  Replace does not connote
destructive or non-destructive to me; it connotes substitution.
The fact that other error handlers happen to be destructive doesn't
affect that at all for me.  YMMV.

  The purpose of the PEP 383 error handler however is to be roundtrip
  safe, so maybe we should choose a slightly different name?  How
  about surrogateescape?

To me, escape has a strong connotation of a multicharacter
representation of a single character, and that's not true here.

How about surrogatetranslate?  I still prefer surrogatereplace, as
it's slightly easier for me to type.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Terry Reedy


Martin v. Löwis wrote:

So are you proposing that I should rename the PEP 383 handler
to utf_8b_encoder_invalid_codepoints?


No, he's saying that your algorithm for choosing the PEP 383 handler
should have come up with that name, rather than utf8b.  But since PEP
383 applies to other codecs besides UTF-8, it should have a different
name.  And one that is less cumbersome than
utf_8b_encoder_invalid_codepoints


Correct.  Thank you Glenn.


I'm still at a loss what name to give it, though. I understand that
I have to rename both error handlers, but I'm uncertain what I should
rename them to. So proposals that rename only one of them aren't
that helpful. It would be helpful if people would indicate support
for Antoine's proposal.


Given your explanation of what the new 'surrogates' handler does (pass 
rather than reject erroneous surrogates), I think 'surrogates_pass' is 
fine.  Thus, I considoer that and 'surrogates_excape' the best proposal 
the best so far and suggest that you make this pair the current status 
quo to be argued against and improved ... or not.


tjr

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis

 Given your explanation of what the new 'surrogates' handler does (pass
 rather than reject erroneous surrogates), I think 'surrogates_pass' is
 fine.  Thus, I considoer that and 'surrogates_excape' the best proposal
 the best so far and suggest that you make this pair the current status
 quo to be argued against and improved ... or not.

That's exactly what I want to avoid: more bike-shedding. If this is now
changed, it cannot be possibly be argued against and improved - it would
be final, end of discussion (please!!!).

So I'm happy to make it surrogatepass and surrogateescape as
proposed by Walter. I'm sure you didn't really mean the spelling of
excape to be taken literally - whether or not you meant the plural
and the underscore literally, I cannot tell. Stephen Turnbull approved
singular, so that's good enough for me.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Gregory P. Smith

On Thu, May 7, 2009 at 12:39 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 Given your explanation of what the new 'surrogates' handler does (pass
 rather than reject erroneous surrogates), I think 'surrogates_pass' is
 fine.  Thus, I considoer that and 'surrogates_excape' the best proposal
 the best so far and suggest that you make this pair the current status
 quo to be argued against and improved ... or not.

 That's exactly what I want to avoid: more bike-shedding. If this is now
 changed, it cannot be possibly be argued against and improved - it would
 be final, end of discussion (please!!!).

 So I'm happy to make it surrogatepass and surrogateescape as
 proposed by Walter. I'm sure you didn't really mean the spelling of
 excape to be taken literally - whether or not you meant the plural
 and the underscore literally, I cannot tell. Stephen Turnbull approved
 singular, so that's good enough for me.

singular is good.

+1 on these names.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Terry Reedy


Martin v. Löwis wrote:

Given your explanation of what the new 'surrogates' handler does (pass
rather than reject erroneous surrogates), I think 'surrogates_pass' is
fine.  Thus, I considoer that and 'surrogates_excape' the best proposal
the best so far and suggest that you make this pair the current status
quo to be argued against and improved ... or not.


That's exactly what I want to avoid: more bike-shedding. If this is now
changed, it cannot be possibly be argued against and improved - it would
be final, end of discussion (please!!!).

So I'm happy to make it surrogatepass and surrogateescape as
proposed by Walter. I'm sure you didn't really mean the spelling of
excape to be taken literally - whether or not you meant the plural
and the underscore literally, I cannot tell. Stephen Turnbull approved
singular, so that's good enough for me.


Those minor tweaks for consistency with existing names are what I meant 
by 'improve' (with good arguments) and I approve of them also. +1 on 
stopping here.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread MRAB


Terry Reedy wrote:

Martin v. Löwis wrote:

Given your explanation of what the new 'surrogates' handler does (pass
rather than reject erroneous surrogates), I think 'surrogates_pass' is
fine.  Thus, I considoer that and 'surrogates_excape' the best proposal
the best so far and suggest that you make this pair the current status
quo to be argued against and improved ... or not.


That's exactly what I want to avoid: more bike-shedding. If this is now
changed, it cannot be possibly be argued against and improved - it would
be final, end of discussion (please!!!).

So I'm happy to make it surrogatepass and surrogateescape as
proposed by Walter. I'm sure you didn't really mean the spelling of
excape to be taken literally - whether or not you meant the plural
and the underscore literally, I cannot tell. Stephen Turnbull approved
singular, so that's good enough for me.


Those minor tweaks for consistency with existing names are what I meant 
by 'improve' (with good arguments) and I approve of them also. +1 on 
stopping here.



We argue because we care. :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread M.-A. Lemburg

Martin v. Löwis wrote:
 The error handler for undoing this operation (ie. when converting
 a Unicode string to some other encoding) should probably use the
 same name based on symmetry and the fact that the escaping
 scheme is meant to be used for enabling round-trip safety.
 
 Could you please familiarize yourself with the implementation
 before commenting further?

I did and it already uses the same (wrong) name for both
encoding and decoding handlers which is good.

The reason for my above comment was that the thread mentions
two different names for the handler depending on the direction,
e.g. surrogatereplace and surrogatepass.

I guess that surrogatepass was just an attempt to find a new
name for the surrogates error handler (which also doesn't
match the naming scheme) and that got me confused.

I'd use allowlonesurrogates as name for the surrogates error
handler and lonesurrogatereplace for the utf8b one.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 08 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK51 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Glenn Linderman

On approximately 5/7/2009 3:27 PM, came the following characters from 
the keyboard of MRAB:

Terry Reedy wrote:

Martin v. Löwis wrote:



So I'm happy to make it surrogatepass and surrogateescape as



These seem adequate.  It is not what I would choose or suggest, but it 
is adequate, and it is unlikely you can delight everyone with your 
choice of names, or even someone else's choice of names.  These at least 
 have a logical justification for their meaning, and can be documented 
reasonably.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 The name utf8b suggested in the PEP is not in line with the codec
 design

Where is that design documented, and how exactly violates the name
the design (chapter and verse, please).

 Error handlers and codecs are two different things, so the namespaces
 need to be clearly separate.

They *are* separate naemspaces; that's guaranteed by the implementation.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

Stephen J. Turnbull wrote:
 Martin v. Löwis writes:
It occurs to me that the PEP maybe should say that it is an error
to have your POSIX locale set to UTF-16 or something like that.
   
   No. It is *impossible* to have UTF-16 as the locale character set,
   not an error. Your statement is like saying it is an error to
   breathe in the vacuum.
 
 I realize this is not useful, so maybe you don't need to mention it.
 However, it certainly is possible to set LANG with an absurd, or
 merely dangerous, encoding.

How so? The C library will filter it out.

   In any case, the discussion says
   
   # Encodings that are not compatible with ASCII are not supported by
   # this specification; bytes in the ASCII range that fail to decode
   # will cause an exception. It is widely agreed that such encodings
   # should not be used as locale charsets.
 
 Which is your excuse for not supporting Shift JIS fully.  It doesn't
 stop people from setting LC_ALL=ja_JP.shift_jis, 

Well, it *does* stop them from doing so if their systems don't support
the locale setting.

In any case, if they do this, PEP 383 will not support them.

 or using Shift JIS as the default encoding for certain media.

I fail to see how this could ever matter. If, by media, you mean
things like removable disks, and the file name encoding used on them,
it's fairly irrelevant for the PEP, since Python won't start using
Shift JIS as its file system encoding just because that's the encoding
used on the disk.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

Second, I suggest surrogate-replace as the name of the error handler
rather than utf8b.
   
   I think this is bike-shedding.
 
 I don't personally care (I already was aware of UTF-8B), but there are
 plenty of others who do. 

I think it is a fairly bad name, because it is easy to confuse it with
the surrogates error handler (unless you suggest to rename that also).

 You have to fix the existing uses of
 the obsolete python-escape, anyway.

Indeed - but only in the PEP. In the implementation, it's already utf8b
throughout. Now it is also in the PEP; thanks for pointing that out.

   It's a security risk. If U+DCXX would map to \xXX, then somebody could
   embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
   sanitized, nobody would expect that this will actually access ../
 
 The odds that anybody will actually take notice of U+002E U+002E
 U+002F in a string are sufficiently small that any number of exploits
 have already been based on it.  I agree that there is some additional
 risk from this if people make the check for ../ before they prepend
 \ucd2e\udc2e\udc2f, but I think that risk is very small compared to
 the pain of having a error handler whose raison d'etre is to not raise
 exceptions go ahead and raise them anyway.

The problem is that functions like normpath will recognize ../, and
that applications rely on them for file name sanitation. If they could
be tricked into writing outside of their target folders, this would
be a huge security risk.

OTOH, I don't care breaking applications on misconfigured systems.
People using SJIS as their locale encodings have bigger problems
than Python raising exceptions.

 See also my reply to Lino Mastrodomenico.

URL?

 But you're writing the PEP, so this battle will have to be deferred.
 Eventually Python will have to take a stand on Unicode conformance,
 but it's not urgent yet.

I think it's always applications that are conforming or not, rather
than libraries. Libraries should allow to write conforming applications.
They may refuse to write certain non-conforming applications (although
users then replace the library with one that does allow them to do
what they want). Libraries can never enforce that applications conform
to some standard.

 Sorry!  I suggest substituting the paragraph above for the paragraph
 which begins The encode error handler interface presentlyrequires...
 at line 129.

Ah, ok. This was Glen Linderman's text before - now it's yours :-)

 I think I forgot to do this before:  I hereby dedicate all text
 I suggest for inclusion in the PEP to the public domain.

:-)

Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 Yeah, yeah, this is the same old same old from PEP 3131.  Anything
 that handles the various attacks based on ASCII-alike characters
 should at least rule out invalid Unicode, too!
 
 And where is this U+DC2F supposed to be coming from, anyway?  The
 user's *local* environment or the user's *local* filesystem! 

Why is that not a threat? Suppose you have a setuid application, and
you pass some string on the command line that decodes to /../. Then
the setuid application will be tricked into modifying files it didn't
mean to modify.

Likewise, it might come from a relational database. Use a relational
database that supports unicode code units, or lone surrogates through
utf-8, and fill in some bogus data. Then have the Python application
(running as root) read it.

 Of course I can't prove that there's no vector for an exploit here (in
 fact, I'm sure there is one with sufficiently careless handling of
 input), but I think consenting adults covers the Shift JIS use case.
 Make it an option, but it should be explicitly part of the PEP.

Nothing is lost at the moment. If users complain, we can still think
of ways to enhance the experience.

In any case, Python 3.1b1 may get released today, so it's way too late
for new features in the PEP. They can wait for Python 3.2.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou

Martin v. Löwis martin at v.loewis.de writes:
 
  I don't personally care (I already was aware of UTF-8B), but there are
  plenty of others who do. 
 
 I think it is a fairly bad name, because it is easy to confuse it with
 the surrogates error handler (unless you suggest to rename that also).

I didn't bother to say it at the time, but I think surrogates is a pretty bad
name. It should be more indicative of what it does, e.g. surrogates-pass, or
surrogates-accept.

It's a security risk. If U+DCXX would map to \xXX, then somebody could
embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
sanitized, nobody would expect that this will actually access ../

Agreed this is an annoying security breach. The whole point of the PEP is that
application developers do not have to care about filename encoding issues,
which is defeated is they have to check for strange (illegal) combinations of
characters.

By the way, what are the ASCII characters that are not suppported by Shift-JIS?
Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
backslash and the tilde).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull

Martin v. Löwis writes:

  I fail to see how this could ever matter. If, by media, you mean
  things like removable disks, and the file name encoding used on them,
  it's fairly irrelevant for the PEP, since Python won't start using
  Shift JIS as its file system encoding just because that's the encoding
  used on the disk.

I'm sorry for the lack of clarity of my posts, but somehow you're
completely missing the point.  The point is precisely that Python
*won't* use Shift JIS as the file system encoding (if it did there
would be no problem with reading Shift JIS), but the people who
created the media *did*.

Now, with Python's file system encoding == UTF-8 or any packed EUC,
and more than a handful of Shift JIS or Big5 characters in file names,
one is *almost certain* to encounter ASCII as the second byte of a
multibyte sequence.  PEP 383 can't handle this, but it is sure to be
the most common use case for PEP 383 in East Asia.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread M.-A. Lemburg

Martin v. Löwis wrote:
 The name utf8b suggested in the PEP is not in line with the codec
 design
 
 Where is that design documented, and how exactly violates the name
 the design (chapter and verse, please).

Martin, I designed the whole Python codec machinery, so even if
this is not explicitly written down somewhere, you can take my
word for it.

I don't want users to be confused by such an error handler
name, so please change it !

Here's a list of the currently available error handlers (taken from
codecs.py):

The .encode()/.decode() methods may use different error
handling schemes by providing the errors argument. These
string values are predefined:

 'strict' - raise a ValueError error (or a subclass)
 'ignore' - ignore the character and continue with the next
 'replace' - replace with a suitable replacement character;
Python will use the official U+FFFD REPLACEMENT
CHARACTER for the builtin Unicode codecs on
decoding and '?' on encoding.
 'xmlcharrefreplace' - Replace with the appropriate XML
   character reference (only for encoding).
 'backslashreplace'  - Replace with backslashed escape sequences
   (only for encoding).

The set of allowed values can be extended via register_error.

 Error handlers and codecs are two different things, so the namespaces
 need to be clearly separate.
 
 They *are* separate naemspaces; that's guaranteed by the implementation.

In the implementation, yes, but not in the head of a typical user:
the 'utf8b' looks more like a codec name than an error handler
name.

I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread MRAB


M.-A. Lemburg wrote:

Martin v. Löwis wrote:

The name utf8b suggested in the PEP is not in line with the codec
design

Where is that design documented, and how exactly violates the name
the design (chapter and verse, please).


Martin, I designed the whole Python codec machinery, so even if
this is not explicitly written down somewhere, you can take my
word for it.

I don't want users to be confused by such an error handler
name, so please change it !

Here's a list of the currently available error handlers (taken from
codecs.py):

The .encode()/.decode() methods may use different error
handling schemes by providing the errors argument. These
string values are predefined:

 'strict' - raise a ValueError error (or a subclass)
 'ignore' - ignore the character and continue with the next
 'replace' - replace with a suitable replacement character;
Python will use the official U+FFFD REPLACEMENT
CHARACTER for the builtin Unicode codecs on
decoding and '?' on encoding.
 'xmlcharrefreplace' - Replace with the appropriate XML
   character reference (only for encoding).
 'backslashreplace'  - Replace with backslashed escape sequences
   (only for encoding).

The set of allowed values can be extended via register_error.


Error handlers and codecs are two different things, so the namespaces
need to be clearly separate.

They *are* separate naemspaces; that's guaranteed by the implementation.


In the implementation, yes, but not in the head of a typical user:
the 'utf8b' looks more like a codec name than an error handler
name.


Judging by the existing names, I think that 'surrogate' would be
reasonable. It already contains the meaning of substitute, it's not too
long, and the codes which act as replacements are already called
surrogates.


I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou

MRAB google at mrabarnett.plus.com writes:
 
 Judging by the existing names, I think that 'surrogate' would be
 reasonable. It already contains the meaning of substitute,

Only if you are a native English-speaker I suppose... For me it's just a
technical term denoting a certain class of unicode code points (I'm not sure of
the latter terminology ;-)).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Lino Mastrodomenico

2009/5/6 Antoine Pitrou solip...@pitrou.net:
 By the way, what are the ASCII characters that are not suppported by 
 Shift-JIS?
 Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
 backslash and the tilde).

The biggest problem with Shift-JIS is that a perfectly valid unicode
character above 127 can be encoded to a byte sequence that includes
bytes in range(128).

E.g. the character 掛 (a.k.a. '\u639b') when encoded with Shift-JIS
becomes the two bytes sequence b'\x8a|'. Notice that the second byte
is 124, which on POSIX is usually interpreted as the pipe character
and can have security implications.

It's a know problem with Shift-JIS and was fixed in UTF-8.

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Lennart Regebro

On Wed, May 6, 2009 at 09:31, Martin v. Löwis mar...@v.loewis.de wrote:
 They *are* separate naemspaces; that's guaranteed by the implementation.

Yes. But utf8b *sounds like* an encoding. When it isn't. I sure
thought it was when it was first mentioned. I agree that it would be
better to find another name.

'utf8-binary-replace'?

Is it only usable with utf8 as an encoding?
-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull

Lino Mastrodomenico writes:

  It's a know problem with Shift-JIS and was fixed in UTF-8.

It was fixed in EUC before Shift-JIS was invented by Microsoft or Big5
was invented by the Taiwanese clone makers.  Guido's not the only
language designer with a time machine


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull

Martin v. Löwis writes:

   Yeah, yeah, this is the same old same old from PEP 3131.  Anything
   that handles the various attacks based on ASCII-alike characters
   should at least rule out invalid Unicode, too!
   
   And where is this U+DC2F supposed to be coming from, anyway?  The
   user's *local* environment or the user's *local* filesystem! 
  
  Why is that not a threat? Suppose you have a setuid application, and
  you pass some string on the command line that decodes to /../. Then
  the setuid application will be tricked into modifying files it didn't
  mean to modify.

Of course this is a threat, assuming that the application takes no
precautions.  But first, it should be stopped by any of several
standard precautions.  For example, applying os.path.realpath (come to
think of it, PEP 383 should say something about realpath, shouldn't
it?) and os.path.normpath (PEP 383 should definitely say something
about this function; maybe PEP 3131 should, too) before checking
access restrictions.  If you're not running your paths through those,
you're already vulnerable to symlink attacks, and maybe other forms of
spoofing.

Second, it's a threat already enabled by your restricted version of
PEP 383.  Access control applies to subdirectories as well as to
parent directories.  Since you can insert arbitrary non-ASCII bytes
into the path using the current definition of 'utf8b', name-based
access restrictions can be bypassed in exactly the same way for any
directory whose name is not 100.00% ASCII, and the setuid application
will be tricked into modifying files it didn't mean to modify.

Also, on Mac OS X, system directories, including directories
containing system libraries, frameworks, and executables, may be
accessible via locale-specific names (I don't have a Japanese-
localized Mac at hand to check, but I'm pretty sure in my old Mac the
Japanese names appeared in ls in Terminal.app, which means it may be
possible to access system directories containing libraries,
frameworks, and executables this way).  Those can be spoofed in
exactly the same way.

  Nothing is lost at the moment.

Nothing is lost compared to 'strict', true, but under the PEP as it is
a large fraction of Shift JIS and Big5 filenames cannot be read under
ASCII-compatible file system encodings using 'utf8b'.  Yet it is those
users who are placed at risk by PEP 383.

  In any case, Python 3.1b1 may get released today, so it's way too late
  for new features in the PEP. They can wait for Python 3.2.

You have convinced me that the PEP should wait as well.

In its current form it is incomplete and dangerous.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread R. David Murray


On Wed, 6 May 2009 at 13:40, Antoine Pitrou wrote:

Stephen J. Turnbull stephen at xemacs.org writes:


Nothing is lost compared to 'strict', true, but under the PEP as it is
a large fraction of Shift JIS and Big5 filenames cannot be read under
ASCII-compatible file system encodings using 'utf8b'.


You should really be more specific. I'm not sure about others, but I don't
understand what filenames you are talking about.


Seems to me that the best thing to do would be to file a bug report with
test cases that demonstrate the problems when run against the current
py3k trunk.

Especially the security issues you cite (which I don't understand).

--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Zooko Wilcox-O'Hearn


On May 6, 2009, at 7:33 AM, Stephen J. Turnbull wrote:


You have convinced me that the PEP should wait as well.

In its current form it is incomplete and dangerous.


+1 on delaying PEP 383

I think PEP 383 is a good idea in principle, but I'm still struggling  
to understand it myself, and it seems to offer new hazards for the  
unwary programmer.


On the other hand, maybe the wary programmers are waiting for Python  
3.2 anyway wink.


On the gripping hand, if PEP 383 is released in Python 3.1, will that  
obligate python-dev to support it indefinitely, at least in backwards- 
compatibility mode?  I'm not thinking of API compatibility as much as  
data compatibility -- someone used Python 3.1 to write down some  
filenames, and now a few years later they are trying to use the  
latest and greatest Python release to read those filenames...


Regards,

Zooko
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread James Y Knight


On May 6, 2009, at 5:39 AM, Stephen J. Turnbull wrote:

Now, with Python's file system encoding == UTF-8 or any packed EUC,
and more than a handful of Shift JIS or Big5 characters in file names,
one is *almost certain* to encounter ASCII as the second byte of a
multibyte sequence.  PEP 383 can't handle this


Hm, I haven't tried the implementation, but I thought that what would  
happen is:
'\x85a'.decode('utf-8', 'utf8b/surrogate-replace/whateveritscalled') - 
 u'\uDC85a'


If that indeed doesn't happen, that's certainly a defect and should be  
remedied.



, but it is sure to be
the most common use case for PEP 383 in East Asia.


Yes.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou

Zooko Wilcox-O'Hearn zooko at zooko.com writes:
 
 I'm not thinking of API compatibility as much as  
 data compatibility -- someone used Python 3.1 to write down some  
 filenames, and now a few years later they are trying to use the  
 latest and greatest Python release to read those filenames...

Well, if the filenames are generated by Python (as opposed to read from an
existing directory on disk), they should be regular unicode objects without any
lone surrogates, so I don't see the compatibility problem.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman

On approximately 5/6/2009 6:33 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:

Martin v. Löwis writes:
  In any case, Python 3.1b1 may get released today, so it's way too late
  for new features in the PEP. They can wait for Python 3.2.

You have convinced me that the PEP should wait as well.

In its current form it is incomplete and dangerous.



I see nothing in this thread that suggests that the PEP is dangerous in 
its current form.


While I (still) think that more readable transcodings could have been 
used, and while I had difficulty fully understanding the PEP at first, 
now that I think I do understand the PEP, and it has been somewhat 
clarified and amended, I cannot see how it could be dangerous.  A 
specific case of danger should be included with such a statement.


Regarding incomplete, I agree it won't brush my teeth for me, but I 
think it does solve the problem it sets out to solve.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman

On approximately 5/6/2009 3:08 AM, came the following characters from 
the keyboard of MRAB:

M.-A. Lemburg wrote:

Martin v. Löwis wrote:



Judging by the existing names, I think that 'surrogate' would be
reasonable. It already contains the meaning of substitute, it's not too
long, and the codes which act as replacements are already called
surrogates.


I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.



+1 for surrogate as the name for the error handler.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman

On approximately 5/6/2009 12:53 AM, came the following characters from 
the keyboard of Martin v. Löwis:



Sorry!  I suggest substituting the paragraph above for the paragraph
which begins The encode error handler interface presentlyrequires...
at line 129.


Ah, ok. This was Glen Linderman's text before - now it's yours :-)



Which is fine by me.  Stephen's is more explanatory than mine, but says 
the same thing.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy


Glenn Linderman wrote:
On approximately 5/6/2009 3:08 AM, came the following characters from 
the keyboard of MRAB:

M.-A. Lemburg wrote:

Martin v. Löwis wrote:



Judging by the existing names, I think that 'surrogate' would be
reasonable. It already contains the meaning of substitute, it's not too
long, and the codes which act as replacements are already called
surrogates.


I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.



+1 for surrogate as the name for the error handler.



+1 from me also

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Zooko Wilcox-O'Hearn


On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:


Zooko Wilcox-O'Hearn zooko at zooko.com writes:


I'm not thinking of API compatibility as much as data  
compatibility -- someone used Python 3.1 to write down some  
filenames, and now a few years later they are trying to use the  
latest and greatest Python release to read those filenames...


Well, if the filenames are generated by Python (as opposed to read  
from an existing directory on disk), they should be regular unicode  
objects without any lone surrogates, so I don't see the  
compatibility problem.


I meant that the application reads filenames from an existing  
directory on disk, saves those filenames, and then later, using a  
future version of Python, wants to read them and use them.


I'm not saying that I know this would be a problem.  I'm saying that  
I personally can't tell whether it would be a problem or not, and the  
extensive discussions so far have not convinced me that there is  
anyone who both understands PEP 383 and considers this use case.


Many people who apparently understand encoding issues well have said  
something to the effect that there is no problem, but those people  
haven't yet managed to get through my thick skull how I would use PEP  
383 safely for this sort of use case -- the one where data generated  
by os.listdir() travels forward in time or the one were that data  
travels sideways to other systems, including Windows or other systems  
that validate incoming unicode.


That's why I am a bit uncomfortable about PEP 383 being quickly  
implemented and deployed in Python 3.1.


By the way, much of the detailed discussion about what Tahoe requires  
and how that may or may not benefit from PEP 383 has now moved to the  
tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/ 
tahoe-dev .


Regards,

Zooko

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman

On approximately 5/6/2009 12:18 PM, came the following characters from 
the keyboard of Zooko Wilcox-O'Hearn:

On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:


Zooko Wilcox-O'Hearn zooko at zooko.com writes:


I'm not thinking of API compatibility as much as data compatibility 
-- someone used Python 3.1 to write down some filenames, and now a 
few years later they are trying to use the latest and greatest Python 
release to read those filenames...


Well, if the filenames are generated by Python (as opposed to read 
from an existing directory on disk), they should be regular unicode 
objects without any lone surrogates, so I don't see the compatibility 
problem.


I meant that the application reads filenames from an existing directory 
on disk, saves those filenames, and then later, using a future version 
of Python, wants to read them and use them.



Regarding future versions of Python.  In the worst case, even if 
Python's default behavior changes, the transcoding done by PEP 383 can 
be done in other software too... it is a straightforward, fully 
specified, 1-to-1, reversible transcoding process, affecting and 
generating only invalid byte encodings on one side, and invalid Unicode 
sequences on the other.


So if Python's default behavior should change, the transcoding 
implemented by PEP 383 could be easily reimplemented to enable a future 
version of a Python application to manipulate the transcoded, saved, 
filenames.


By easily, I mean that I could code it in a couple hours, max.


I'm not saying that I know this would be a problem.  I'm saying that I 
personally can't tell whether it would be a problem or not, and the 
extensive discussions so far have not convinced me that there is anyone 
who both understands PEP 383 and considers this use case.



Does the above help?


Many people who apparently understand encoding issues well have said 
something to the effect that there is no problem, but those people 
haven't yet managed to get through my thick skull how I would use PEP 
383 safely for this sort of use case -- the one where data generated by 
os.listdir() travels forward in time or the one were that data travels 
sideways to other systems, including Windows or other systems that 
validate incoming unicode.



Regarding data traveling sideways, some comments:

1) PEP 383's effect could be recoded in other languages as easily as it 
is in Python (or the C in which Python is implmented).  So that could be 
a solution.


2) You mention Windows and other systems that validate incoming 
unicode in the same phrase, as if you think that Windows qualifies as 
 an other systems that validate incoming unicode, but it does not (at 
least not universally).



That's why I am a bit uncomfortable about PEP 383 being quickly 
implemented and deployed in Python 3.1.



Does the above help?


By the way, much of the detailed discussion about what Tahoe requires 
and how that may or may not benefit from PEP 383 has now moved to the 
tahoe-dev mailing list: 
http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev .



I have no background with Tahoe, nor particular interest, although it 
sounds like a useful project... so I won't be joining that list.  I have 
no idea if there is an installed base of existing Tahoe file systems, my 
suggestions below assume that there is not, and that you are presently 
inventing them.  Therefore, I provide no migration path, although I 
could invent one, but it would take longer to describe.


However, since I'm responding here, and have read what you have posted 
here, it seems like the following could be true.


Assumptions from your emails:

A) Tahoe wants to provide a UTF-8 file name system
B) Tahoe wants to interface to POSIX systems that use (and do not 
validate) byte interfaces.
C) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
name interfaces, with no validation.
D) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
name interfaces, with validation.


Uncertainties: I'm not clear on what your goals are for Tahoe filenames. 
 There seem to be 2 possibilities:


1) you want to reject attempts to use non-validating Unicode, be it from 
a 16-bit interface, or a bytes interface.
2) you don't want to reject non-validating Unicode, but you want to 
convert it to valid Unicode for (D) systems.


3) Orthogonally, you might want to store only Valid Unicode in the 
names, or you might not care, if you can meet the other goals.


Truisms:

If you want to support (D), and (2), then you must transform names at 
some point, using some scheme, because not all names supplied by (B) 
systems will be acceptable to (D) systems.  You can choose to do this 
transformation when a (B) system provides an invalid (per Unicode) name, 
or you can choose to do the transformation when a (D) system accesses a 
file with an invalid (per Unicode) name.


If the (B) and (D) systems talk to each other outside of Tahoe, they 
will have to do similar

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 The name utf8b suggested in the PEP is not in line with the codec
 design
 Where is that design documented, and how exactly violates the name
 the design (chapter and verse, please).
 
 Martin, I designed the whole Python codec machinery

Not true. PEP 293 was written and designed by Walter Dörwald.

 so even if
 this is not explicitly written down somewhere, you can take my
 word for it.

If the design was specified in writing somewhere, I would probably
challenge it as obsolete. If it isn't described anywhere, I'll have
to ignore it.

 I want to avoid any such confusion with Python codecs and don't
 understand why you are making a problem out of this.

Because utf8b (or, perhaps UTF-8b) is the official name for this
algorithm:

http://hyperreal.org/~est/utf-8b/

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 I'm sorry for the lack of clarity of my posts, but somehow you're
 completely missing the point.  The point is precisely that Python
 *won't* use Shift JIS as the file system encoding (if it did there
 would be no problem with reading Shift JIS), but the people who
 created the media *did*.
 
 Now, with Python's file system encoding == UTF-8 or any packed EUC,
 and more than a handful of Shift JIS or Big5 characters in file names,
 one is *almost certain* to encounter ASCII as the second byte of a
 multibyte sequence.  PEP 383 can't handle this

Not true. PEP 383 handles this very example just fine, with no problems
that I can see. Can you propose a specific example that you think might
cause problems? By specific, I mean: what file names (exact bytes,
please), what locale charset, what API calls.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 Judging by the existing names, I think that 'surrogate' would be
 reasonable

MAL's list of existing names is incomplete. surrogates is already
an existing name, also, and it means something different (similar,
but different).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

Terry Reedy wrote:
 Glenn Linderman wrote:
 On approximately 5/6/2009 3:08 AM, came the following characters from
 the keyboard of MRAB:
 M.-A. Lemburg wrote:
 Martin v. Löwis wrote:

 Judging by the existing names, I think that 'surrogate' would be
 reasonable. It already contains the meaning of substitute, it's not too
 long, and the codes which act as replacements are already called
 surrogates.

 I want to avoid any such confusion with Python codecs and don't
 understand why you are making a problem out of this.


 +1 for surrogate as the name for the error handler.


 +1 from me also

Despite there being also an error handler called surrogates.

Are you serious?

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 Is it only usable with utf8 as an encoding?

No, it applies to any codec which potentially cannot decode
all bytes 127.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou

Martin v. Löwis martin at v.loewis.de writes:
 
 Despite there being also an error handler called surrogates.

People, perhaps we could end all the bikeshedding and call one of those handlers
surrogates-pass and the other surrogates-escape, which sounds quite faithful
to what they actually /do/?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 But first, it should be stopped by any of several
 standard precautions.  For example, applying os.path.realpath (come to
 think of it, PEP 383 should say something about realpath, shouldn't
 it?)

Why do you think so? I think the existing documentation of realpath
is correct and complete.

 and os.path.normpath (PEP 383 should definitely say something
 about this function

Precisely what?

 maybe PEP 3131 should, too)

How can this be of relevance?

   Nothing is lost at the moment.
 
 Nothing is lost compared to 'strict', true, but under the PEP as it is
 a large fraction of Shift JIS and Big5 filenames cannot be read under
 ASCII-compatible file system encodings using 'utf8b'.  Yet it is those
 users who are placed at risk by PEP 383.

I think this statement is incorrect. Those filenames *can* be read just
fine.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

Antoine Pitrou wrote:
 Martin v. Löwis martin at v.loewis.de writes:
 Despite there being also an error handler called surrogates.
 
 People, perhaps we could end all the bikeshedding and call one of those 
 handlers
 surrogates-pass and the other surrogates-escape, which sounds quite 
 faithful
 to what they actually /do/?

The problem with these bike-shedding discussions is that you cannot stop
them with a proposal. People will counter-propose.

I would be willing to accept a ruling from someone who a) is a native
speaker of English, and b) has demonstrated to fully understand what
these do, and c) has understood why I insist on calling it utf8b.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy


Martin v. Löwis wrote:


+1 for surrogate as the name for the error handler.



+1 from me also


Despite there being also an error handler called surrogates.


Given that additional information which MAL apparently omitted, I would 
revise.



Are you serious?


Are you? ;-?  You are the one naming a codec-agnostic error handler (if 
I understand correctly, and correct me if I do not) after a particular 
codec, and denying that that could cause confusion.  See other message.


Terry Jan Reedy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy


Martin v. Löwis wrote:


Because utf8b (or, perhaps UTF-8b) is the official name for this
algorithm:
http://hyperreal.org/~est/utf-8b/


Thank you for the link.  It starts:
This directory contains a C implementation of a UTF-8b codec.
A Python codec based on it is provided as well.

'RTF-8b' consists, obviously, 'UTF-8' plus 'b', with the 'b' signifying 
a variation of or addition to UTF-8.  The 'b', and only the 'b', refers 
to the innovative error-handler that was added to the existing 'UTF-8' 
codec/algorithm.  The name of the combined whole is not the name of the 
part.


If you were incorporating the Python-wrapped utf-8b *codec* as a codec, 
which is what I once thought *because you used that name*, then calling 
it 'utf-8b' would be fine.  But you apparently instead proposed and 
implemented an *error-handler*, which seems to me to be something else, 
and which will not be specific to utf-8 but usable with any codec. 
Hence some of us think it should have a different name.


I gather that you lifted the error-handler part of the algorithm and 
propose to use it with *any* ascii-respecting codec.  I could claim that 
the 'official name' of that part is 'b', but I think we can find a 
better name.


Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Paul Moore

2009/5/6 Antoine Pitrou solip...@pitrou.net:
 Martin v. Löwis martin at v.loewis.de writes:

 Despite there being also an error handler called surrogates.

 People, perhaps we could end all the bikeshedding and call one of those 
 handlers
 surrogates-pass and the other surrogates-escape, which sounds quite 
 faithful
 to what they actually /do/?

We could also stop the bikeshedding by sticking with the name utf8b.
Martin's comment that it is the official name for this algorithm seems
compelling to me (even if it is confusing because of its similarity
with utf-8).

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy


Martin v. Löwis wrote:

Antoine Pitrou wrote:

Martin v. Löwis martin at v.loewis.de writes:

Despite there being also an error handler called surrogates.

People, perhaps we could end all the bikeshedding and call one of those handlers
surrogates-pass and the other surrogates-escape, which sounds quite faithful
to what they actually /do/?


The problem with these bike-shedding discussions is that you cannot stop
them with a proposal. People will counter-propose.

I would be willing to accept a ruling from someone who a) is a native
speaker of English, and b) has demonstrated to fully understand what
these do, and c) has understood why I insist on calling it utf8b.


I qualify with a). I believe I understand c) but, as explained in my 
other post, I do not think your reason applies.  In fact, I think 
concern for naming rights might suggest that you *not* reuse the name 
for something different.  I would have to learn more about the existing 
'surrogates' handler to judge Antione's suggestion 'surrogates-pass'. 
'Surrogates-escape' is pretty good for the new handler since, to my 
understanding, it 'escapes' 'bad bytes' by prefixing them with bits that 
push them to the surrogates plane.


I have been supportive of the idea and, as well as I understood them, 
the particulars of your proposal, from the beginning.  Reusing the name 
of a codec as the name of an error-handler confused me and I believe it 
will confuse others, even though, but also because, the error handler 
was extracted and generalized from the codec.


Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 Are you serious?
 
 Are you? ;-?  You are the one naming a codec-agnostic error handler (if
 I understand correctly, and correct me if I do not) after a particular
 codec, and denying that that could cause confusion.  See other message.

I can only repeat what I said before: I call it utf8b because that's
the established name for the algorithm it implements.

That algorithm was originally designed with UTF-8 in mind (and only
meant to be applied for UTF-8), however, it remains the same algorithm
even though PEP 383 widens its application.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread MRAB


Antoine Pitrou wrote:

Martin v. Löwis martin at v.loewis.de writes:

Despite there being also an error handler called surrogates.


People, perhaps we could end all the bikeshedding and call one of those handlers
surrogates-pass and the other surrogates-escape, which sounds quite faithful
to what they actually /do/?


After having read about the existing error handler called surrogates
and having thought about it, I've decided that calling one just
surrogates isn't very helpful to the user; it has something to do with
surrogates, but what?

So +1 for Antoine's suggestion from me.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 I qualify with a). I believe I understand c) but, as explained in my
 other post, I do not think your reason applies.  In fact, I think
 concern for naming rights might suggest that you *not* reuse the name
 for something different.  I would have to learn more about the existing
 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'.
 'Surrogates-escape' is pretty good for the new handler since, to my
 understanding, it 'escapes' 'bad bytes' by prefixing them with bits that
 push them to the surrogates plane.

See issue 3672. In essence, in python 2.5:

py u\ud800.encode(utf-8)
'\xed\xa0\x80'
py '\xed\xa0\x80'.decode(utf-8)
u'\ud800'

In 3.1,

py \ud800.encode(utf-8)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
py \ud800.encode(utf-8,surrogates)
b'\xed\xa0\x80'
py b'\xed\xa0\x80'.decode(utf-8)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
illegal encoding
py b'\xed\xa0\x80'.decode(utf-8,surrogates)
'\ud800'

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou

Martin v. Löwis martin at v.loewis.de writes:
 py b'\xed\xa0\x80'.decode(utf-8,surrogates)
 '\ud800'

The point is, surrogates does not mean anything intuitive for an /error
handler/. You seem to be the only one who finds this name explicit enough,
perhaps because you chose it.
Most other handlers' names have verbs in them (ignore, replace,
xmlcharrefreplace, etc.).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Michael Urman

On Wed, May 6, 2009 at 15:42, Martin v. Löwis mar...@v.loewis.de wrote:
 Despite there being also an error handler called surrogates.

Not that I have to be, but I'm not sold on the previous UTF-8 codec
behavior becoming an error handler of the name surrogates for two
reasons (I do respect the obvious PBP argument for the implementation,
and have no better name - lenient?).

First, unless there's a way to stack error handlers, there's no way to
access the old behavior combined with the replace handler. Second,
errors=surrogates reads like surrogates should be an error, not an
additionally allowed pattern. Neither of these are deal breakers or
hard to learn, but they are non-obvious. I think the utf8b behavior
makes a lot more sense with the name surrogates, through the
mnemonic that errors become surrogates.

The stacking argument also applies to the new utf8b behavior on encode
(only, as it handles all errors on decode). This may be a YAGNI, but
for a non-UTF-8 encode, it may be useful to allow xmlcharrefreplace
handling for unavailable non-surrogate-escaped characters. But without
stacking that's unmaintainable, as we clearly don't want ${codec}b for
all current codecs.

I'd be perfectly happy with utf8b or UTF-8b, as either a codec or an
error handler (do we want both? YAGNI?). So what if it smells a little
inaccurate as a handler when used with codecs other than UTF-8, no big
deal. I could also see something like errors=roundtrip which
explains the intention of the handler rather than the algorithm, but
is awkward on encode when it encounters unavailable Unicode
characters.

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread M.-A. Lemburg

Martin v. Löwis wrote:
 The name utf8b suggested in the PEP is not in line with the codec
 design
 Where is that design documented, and how exactly violates the name
 the design (chapter and verse, please).
 Martin, I designed the whole Python codec machinery
 
 Not true. PEP 293 was written and designed by Walter Dörwald.

Walter added the generic error handler callback mechanism and
we both worked on their design.

I designed and wrote the codec implementation back in 2000,
which included the whole idea of having codec error handlers in the
first place.

The original implementation only allowed per-codec
error handlers. Walter extended this to build general-purpose
handlers that could be used by many codecs. His original
motivation was to be able to do XML character reference
escaping.

If you don't believe me, go look this up in the repository, the
mailing list archives and the trackers.

 so even if
 this is not explicitly written down somewhere, you can take my
 word for it.
 
 If the design was specified in writing somewhere, I would probably
 challenge it as obsolete. If it isn't described anywhere, I'll have
 to ignore it.

Ah, lovely attitude.

 I want to avoid any such confusion with Python codecs and don't
 understand why you are making a problem out of this.
 
 Because utf8b (or, perhaps UTF-8b) is the official name for this
 algorithm:
 
 http://hyperreal.org/~est/utf-8b/

That's a codec implementing the escaping idea proposed by Markus
Kuhn, not an official reference. AFAIK, the term UTF-8B originated
from a UTF-8 + binary codec written for iconv:

http://mail.nl.linux.org/linux-utf8/2006-04/msg2.html

If it were the official name of an escape algorithm, as you are
suggesting, the inventor Markus Kuhn would probably have chosen
it, but he hasn't... the only reference to it is an email where it
is described as option D for ways of dealing with malformed
UTF-8 data in a decoder:

http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

Note that this escape method is not applicable for data that
you decode from UTF-8 and then e.g. encode as Latin-1. It only
works as general purpose method if you are decoding and encoding
using the same codec, since it is specifically designed to
assure round-trip safety.

Martin, please stop being silly and just change the name.

Or drop the idea of using an error handler altogether and just let
people use the utf-8b codec you referenced above to solve their
problems whereever and if needed.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK52 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull

Martin v. Löwis writes:

   Now, with Python's file system encoding == UTF-8 or any packed EUC,
   and more than a handful of Shift JIS or Big5 characters in file names,
   one is *almost certain* to encounter ASCII as the second byte of a
   multibyte sequence.  PEP 383 can't handle this

Ah, I see.  Of course, the algorithm not only has to handle the ASCII
octet which is erroneous because it can't be a trailing byte, but
*also the leading byte that signalled to expect a trailing byte 127*.
So the algorithm backs up to the character boundary (which is
well-defined for all the sane encodings), encode the high byte(s) in
the character with lone surrogates, and encode the ASCII as itself
(promoted to a Unicode code point).

Sorry, you're right, I was just confused.  I withdraw the objection as
completely mistaken, and apologize for not thinking more carefully in
the first place.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy


Martin v. Löwis wrote:

Are you serious?

Are you? ;-?  You are the one naming a codec-agnostic error handler (if
I understand correctly, and correct me if I do not) after a particular
codec, and denying that that could cause confusion.  See other message.


I can only repeat what I said before: I call it


What, specifically, is 'it'?


utf8b because that's
the established name for the algorithm


Which algorithm?


it implements.


Again, what is 'it'?

As *I* read the sentence above, it is not true.

I went to the site you referred to as the source of your reasoning and 
specifically

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/utf_8b.c

The algorithm called utf-8b *IS* utf-8 with the addition or replacement 
(of an error return) of essentially one line in each direction:


# encode
if 0xDC00 = codepoint = 0xDCFF:
byte = codepoint - 0xDC00 #encode

Note: for security concerns, you are increasing the lower limit to 
0xDC80. The comment at the top of the utf_8b.c, suggests that that is 
what it should be and should have been in the file, with the other half 
of that surrogate area an error along with the other surrogate area.


#decode
if (0x80 = byte = 0xFF) and utf-8-invalid(byte):
codepoint = byte + 0xDC00 # decode


That algorithm was originally designed with UTF-8 in mind (and only
meant to be applied for UTF-8), however, it remains the same algorithm
even though PEP 383 widens its application.


The error handler designed with utf-8 in mind has no name in the encode 
direction and is called utf_8b_decoder_invalid_bytes in the decode 
direction.  By your reasoning, *that* should be its name in Python.  The 
encoding error handler would then be named analogously 
utf_8b_encoder_invalid_codepoints.  Even these, to me, would be better 
than confusing giving them the same name as the codec.


Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman

On approximately 5/6/2009 6:06 PM, came the following characters from 
the keyboard of M.-A. Lemburg:



Martin, please stop being silly and just change the name.



Yes, please.  If indeed Marc-Andre invented the codec business as he 
claims, he would be an appropriate person to give a fiat name to the 
error handler.




Or drop the idea of using an error handler altogether and just let
people use the utf-8b codec you referenced above to solve their
problems whereever and if needed.



The design as an error handler is clever in leveraging the same error 
handler for multiple codecs, which cannot be done by using utf-8b alone, 
if I understand correctly.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

Michael Urman wrote:
 On Wed, May 6, 2009 at 15:42, Martin v. Löwis mar...@v.loewis.de wrote:
 Despite there being also an error handler called surrogates.
 
 Not that I have to be, but I'm not sold on the previous UTF-8 codec
 behavior becoming an error handler of the name surrogates for two
 reasons (I do respect the obvious PBP argument for the implementation,
 and have no better name - lenient?).

PBP?

 First, unless there's a way to stack error handlers, there's no way to
 access the old behavior combined with the replace handler.

Well, there is a way to stack error handlers, although it's not pretty:

_surrogates = codecs.lookup_errors(surrogates)
_replace = codecs.lookup_errors(replace)
def surrogates_then_replace(exc):
try:
return _surrogates(exc)
except UnicodeError:
return _replace(exc)
codecs.register_error(surrogates_then_replace,
  surrogates_then_replace)

 The stacking argument also applies to the new utf8b behavior on encode
 (only, as it handles all errors on decode). This may be a YAGNI

Indeed - in particular, as, in the primary application of this error
handler (i.e. file IO operations), there is no way of specifying
an addition error handler anyway.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis

 The error handler designed with utf-8 in mind has no name in the encode
 direction and is called utf_8b_decoder_invalid_bytes in the decode
 direction.  By your reasoning, *that* should be its name in Python.  The
 encoding error handler would then be named analogously
 utf_8b_encoder_invalid_codepoints.  Even these, to me, would be better
 than confusing giving them the same name as the codec.

So are you proposing that I should rename the PEP 383 handler
to utf_8b_encoder_invalid_codepoints?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread M.-A. Lemburg

On 2009-05-03 19:39, Martin v. Löwis wrote:
 If the error handler is supposed to be used for codecs other than utf-8,
 perhaps it should renamed something more generic, e.g. surrogate-escape?
 
 Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
 it's an algorithm based on 16-bit or 32-bit code points.

If the error handler doesn't have anything to do with UTF-8, then why
do you use utf8 in the name.

Please use a more descriptive name for the handler which does not cause
confusion with a existing codec.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 05 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK54 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Terry Reedy


M.-A. Lemburg wrote:

On 2009-05-03 19:39, Martin v. Löwis wrote:

If the error handler is supposed to be used for codecs other than utf-8,
perhaps it should renamed something more generic, e.g. surrogate-escape?

Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
it's an algorithm based on 16-bit or 32-bit code points.


If the error handler doesn't have anything to do with UTF-8, then why
do you use utf8 in the name.

Please use a more descriptive name for the handler which does not cause
confusion with a existing codec.


Having already been confused, I agree.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull

M.-A. Lemburg writes:
  On 2009-05-03 19:39, Martin v. Löwis wrote:
   If the error handler is supposed to be used for codecs other than utf-8,
   perhaps it should renamed something more generic, e.g. surrogate-escape?
   
   Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
   it's an algorithm based on 16-bit or 32-bit code points.

I don't understand this phrasing.  The algorithm is only applicable to
ASCII-compatible octet streams.  It results in code points by a simple
displacement of octet - octet + 0xDC00.  It cannot be used on (say)
UTF-32 to deal with embedded surrogates.

Certainly, the computation requires (at least) 16 bit numbers, but the
input must be restricted to a stream of 8-bit code points, while the
output is 16- or 32-bit code points.

  Please use a more descriptive name [than utf-8b] for the handler
  which does not cause confusion with a existing codec.

But please don't use surrogate-escape or (as in the current PEP)
python-escape; it's not an escaping (quotation) mechanism.
surrogate-replace, surrogate-substitute, or surrogate-translate
would be better names.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Zooko O'Whielacronx

On Tue, May 5, 2009 at 8:57 AM, Stephen J. Turnbull step...@xemacs.org wrote:

 2.  The specification should state, and the discussion emphasize, that
    strings which were produced by surrogate replacement *must not* be
    used in data interchange with systems that do not specifically
    accept such strings, and that this is the responsibility of the
    application.[2]

That sounds like a useful statement to make.  How would an application
make sure that they were producing only valid unicode?  How about add
an option to os.listdir() named errors with default value 'utf8b'
(or 'surrogate-replace', or whatever the name is)?  Then applications
which need to produce only valid unicode strings could pass
errors=strict, errors=ignore, or errors=replace?  (If anyone really
wants behavior like Python 3.0 then we could perhaps also add a new
one just for os.listdir() named errors=skipfilename.)

My most recent plan for Tahoe, as of the letter that I sent last
night, is to emulate the behavior of Nautilus and GNU ls by using the
'replace' error handler and (emulating Nautilus) to append  (invalid
encoding) to the end of the string.  (screenshot:
http://zooko.com/Nautilus_vs_invalid_encoding.png )

So if I could ask os.listdir to return filenames with U+FFFD in place
of undecodable characters, then I could subsequently do something
like:

for f in os.listdir(d, errors='replace'):
if u\ufffd in f:
f +=  (invalid encoding)

(On top of that I would have to check for collisions, but that's out of scope.)

Regards,

Zooko
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread MRAB


Stephen J. Turnbull wrote:

Martin v. Löwis writes:

  I've updated the PEP accordingly.

I have three substantive comments.  First, although consequences for
Python 3 byte interfaces (ie, none) are explicitly stated, as far as
I can see this PEP could apply to Python 2 as well.  I don't think
it's intended that way.  Either way, I think you should clarify that
point.

Second, I suggest surrogate-replace as the name of the error handler
rather than utf8b.  (Elsewhere I've suggested others, but I think
this is the best of the bunch.)


+1


Third, it is not clear to me why non-decodable ASCII should be an
error.  There are plenty of low surrogates for the purpose.  Is there
another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
file systems are quite common in Asia still (including non-rewritable
media).  I think surrogate-replacement of ASCII should at least be an
option.

I don't think people shouldn't be using non-ASCII-compatible
encodings for locale encodings is a sufficient rationale for a hard
error here.  I mean, of course they *should* be using UTF-8.  Maybe
Python 3.1 should just go ahead and error on any other encoding on
POSIX platforms? wink


I don't see why the error handler couldn't in principle be used with
encodings other than UTF-8, although in that case all of the low
surrogates should be open to use.


I have a number of nitpicking comments and technical clarifications on
the PEP.  Rationale is in footnotes.  There were also a few typos I
noticed.

1.  There is no such thing as a half-surrogate in Unicode.  Lone
surrogate is clear enough.  Or for somewhat fancier English,
isolated surrogate or non-syntactic surrogate.  To emphasize
that Python codecs will only produce them in contexts where a
Unicode character or high surrogate (for UTF-16 Python) is
syntactically required, isolated low surrogate or isolated
trailing surrogate might be good.[1]

2.  The specification should state, and the discussion emphasize, that
strings which were produced by surrogate replacement *must not* be
used in data interchange with systems that do not specifically
accept such strings, and that this is the responsibility of the
application.[2]

Rather than saying that dealing with such conflicts is out of
scope of this PEP, I would say

Dealing with such conflicts is the responsibility of the
application.  Since this PEP's mechanism produces valid Unicode
where possible, and produces *invalid* code points only via the
error handler, one strategy is for the application to validate all
other sources of strings as Unicode conforming.  There may be
other useful application-specific strategies, as well.

3.  In the discussion, the transition from the example of alternative
use of 'python-escape' to discussion of the error handler
interface extension is a bit abrupt.  I suggest rewriting as:

The extension to the encode error handler interface proposed by
this PEP is necessary to implement the 'utf8b' error handler,
because there are required byte sequences which cannot be
generated from replacement Unicode.  However, the encode error
handler interface presently requires replacement Unicode to be
provided in lieu of the non-encodable Unicode from the source
string.  Then it promptly encodes that replacement Unicode.  In
some error handlers, such as the 'utf8b' proposed here, it is also
simpler and more efficient for the error handler to provide a
pre-encoded replacement byte string, rather than forcing it to
calculating Unicode from which the encoder would create the
desired bytes.

Typos (line references are to pep-0383.txt svn r72332):

l.  86: Byte-orientied - Byte-oriented
l.  98, 118, 124, 127, 132, 136: python-escape - utf8b
l. 130: provide - provided
l. 134: calculating - calculate


Footnotes: 
[1] Unicode 5.0 uses the terms high-half and low-half at least

once, in section 16.6, but the context is such that I take it to
refer to half of the surrogate area.  Section 3.8 doesn't use
these, instead noting that leading and trailing are sometimes
used instead of high and low.  Better to avoid the word half
in PEP 383, I think.


Leading and trailing simply state the order, not the set (high or
low), so are not good terms to use.


[2] Since this error handler is going to be the default for POSIX I/O,
of course people are going to mostly ignore that restriction.  The
point is, passing such strings to systems that don't expect them
is a bug, and the PEP should make it clear that it's the app's
bug, not the other system's.  On the other hand, using those
strings in a context of consenting adults (and I do mean
double-opt-in here) is perfectly acceptable.  I'm specifically
thinking of use in the Tahoe protocol discussed by Zooko
O'Whielacronx; it may not be usable there for backward
compatibility reasons, but Unicode

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull

Zooko O'Whielacronx writes:

  How would an application make sure that they were producing only
  valid unicode?

That's very difficult.  There are a couple of sources that I can think
of, in Python: C modules, chr(), \u literals, and now codecs with the
'utf8b'.  There may be others.  You'd need to review your own code for
all of them very carefully, and you'd have to validate all strings
returned by non-validating APIs (which is all of them in Python now,
although many of them can probably be trusted, such as codecs not
using the 'utf8b' error handler).

  How about add an option to os.listdir() named errors with default
  value 'utf8b'

Seems reasonable to me, but Martin's probably thought more carefully
about it.  I don't think its applicable to your use case, though,
because you want to be able to *access* those files as well as display
the names to the users, right?  You won't be able to access those
files if you receive the names already munged by the error handler.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull

MRAB writes:

   I don't think people shouldn't be using non-ASCII-compatible
   encodings for locale encodings is a sufficient rationale for a hard
   error here.  I mean, of course they *should* be using UTF-8.  Maybe
   Python 3.1 should just go ahead and error on any other encoding on
   POSIX platforms? wink
   
  I don't see why the error handler couldn't in principle be used with
  encodings other than UTF-8, although in that case all of the low
  surrogates should be open to use.

I should have been more clear here, I guess.  The error handler *can*,
and in the PEP *will be* by default, used with all sane locale
encodings on POSIX.

It occurs to me that the PEP maybe should say that it is an error
to have your POSIX locale set to UTF-16 or something like that.

What sane means in this context is

1.  ASCII NUL is the bytearray terminator, and can't be used as a byte
in a file name.  This rules out UTF-16, UTF-32, and widechar EUC
encodings, as well as some very rare ones.

2.  An ASCII character always translates to the Unicode character with
the same code (ie, to itself).  It is not a part of other
sequences (control sequences, or a trailing byte).  This rules out
EBCDIC, ISO-2022-*, Shift JIS, and Big5, among the encodings I'm
familiar with.  EBCDIC because only by accident will an EBCDIC
character map to the same ASCII character with the same code.  The
ISO-2022-* encodings are out because ASCII characters are used in
escape sequences.  Shift JIS and Big5 because in those encodings,
a high-bit-set octet signals the start of a multibyte sequence,
and some of the trailing bytes may be in the ASCII range.

What's left?  Well, UTF-8, all of the ISO-8859 sets, several national
standards (such as the KOI8 family for Cyrillic), IBM and Microsoft
code pages, and the packed EUC encodings used for Japanese,
Chinese, and Korean.  These all have the character that ASCII is
ASCII, and all non-ASCII characters are encoded using only
high-bit-set octets.  In fact, in practice, on Unix these are
invariably what you encounter.

So what's the problem?  Backward compatibility for Microsoft OSes,
which not only used to use MBCS national character sets, but
cleverly packed more characters into the encoding by using ASCII as
trailing bytes.  Ie, the aforementioned insane Shift JIS (which is
mandated by the leading Japanese cellphone service provider even
today) and Big5 (the leading encoding for Chinese until very
recently).  These are very commonly found on archival media, and even
on USB keys and so on which tend to be FAT-formatted.  This doesn't
prevent usage of the Unicode APIs, but up to Windows 2000 most
Japanese vendors' OEM version of Windows used FAT format and Shift JIS
as the file system encoding, and I know of Japanese offices where
Windows 98 systems were in use as recently as early 2007.

It's the removable media which are the problem, because on Windows you
just use the Unicode APIs.  But they're not available on Unix, so you
need the byte-oriented APIs.

Is this a real problem?  I don't know, I don't do Windows, I don't do
computing with my cellphone, and I don't need to get Japanese (that
might be mixed with Russian ones!!) filenames off of ancient media or
CIFS fileshares using Shift JIS.  I guess it's possible that
cellphones do everything *except* add filenames to directories in
Shift JIS, but the filenames are in UTF-16.

OTOH, it seems to me that an *optional* extension to handling error on
ASCII is technically feasible and would be nearly trivial to add to
the PEP.  The biggest cost would be adding the error argument to
various functions (as Zooko requested) so that
surrogate-replace-extended could be specified if needed.

   Footnotes: 
   [1] Unicode 5.0 uses the terms high-half and low-half at least
   once, in section 16.6, but the context is such that I take it to
   refer to half of the surrogate area.  Section 3.8 doesn't use
   these, instead noting that leading and trailing are sometimes
   used instead of high and low.  Better to avoid the word half
   in PEP 383, I think.
   
  Leading and trailing simply state the order, not the set (high or
  low), so are not good terms to use.

But it's the order that's important.  If you've just finished reading
a character, and encounter a trailing surrogate, then it was produced
by the 'utf8b' error handler; nothing else in a Python codec can do
that.  If you've just finished reading a character, are in a UTF-16
Python, and encounter a leading surrogate, then you immediately gobble
the following code, which must be a trailing surrogate, and combine
them to produce a character.  The remaining case is that you encounter
a valid character.  Anything else is an error, and (assuming no bugs),
no Python codec will produce anything else.

   This does imply that programs that take advantage of the error
   handler specified in this PEP are on their own if they accept

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread MRAB


Stephen J. Turnbull wrote:

MRAB writes:

   I don't think people shouldn't be using non-ASCII-compatible
   encodings for locale encodings is a sufficient rationale for a hard
   error here.  I mean, of course they *should* be using UTF-8.  Maybe
   Python 3.1 should just go ahead and error on any other encoding on
   POSIX platforms? wink
   
  I don't see why the error handler couldn't in principle be used with

  encodings other than UTF-8, although in that case all of the low
  surrogates should be open to use.

I should have been more clear here, I guess.  The error handler *can*,
and in the PEP *will be* by default, used with all sane locale
encodings on POSIX.

It occurs to me that the PEP maybe should say that it is an error
to have your POSIX locale set to UTF-16 or something like that.

What sane means in this context is

1.  ASCII NUL is the bytearray terminator, and can't be used as a byte
in a file name.  This rules out UTF-16, UTF-32, and widechar EUC
encodings, as well as some very rare ones.


[snip]
It might be slightly OT, but sometimes strict UTF-8 encoding is violated
by encoding U+ using 2 bytes (0xC0 0x80) so that 0x00 can be used as
a terminator. I think I read that Microsoft sometimes does this.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull

MRAB writes:

  [snip]
  It might be slightly OT, but sometimes strict UTF-8 encoding is violated
  by encoding U+ using 2 bytes (0xC0 0x80) so that 0x00 can be used as
  a terminator. I think I read that Microsoft sometimes does this.

Nice hack! as long as you don't let it escape.  But if 'strict' errors
on this, then PEP 383 'utf8b' will do the right thing, I think.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis

Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
it's an algorithm based on 16-bit or 32-bit code points.
 
 I don't understand this phrasing.  The algorithm is only applicable to
 ASCII-compatible octet streams.  It results in code points by a simple
 displacement of octet - octet + 0xDC00.  It cannot be used on (say)
 UTF-32 to deal with embedded surrogates.
 
 Certainly, the computation requires (at least) 16 bit numbers, but the
 input must be restricted to a stream of 8-bit code points, while the
 output is 16- or 32-bit code points.

Right - the algorithm maps between bytes and 16/32-bit code units.
It works, in particular, for UTF-8, and was originally proposed to apply
to UTF-8 - but it can work in any other place that converts bytes to
16/32-bit code units as well.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis

 I have three substantive comments.  First, although consequences for
 Python 3 byte interfaces (ie, none) are explicitly stated, as far as
 I can see this PEP could apply to Python 2 as well.  I don't think
 it's intended that way.  Either way, I think you should clarify that
 point.

Done: the Python-Version header already clarifies that point.

 Second, I suggest surrogate-replace as the name of the error handler
 rather than utf8b.

I think this is bike-shedding.

 Third, it is not clear to me why non-decodable ASCII should be an
 error.  There are plenty of low surrogates for the purpose.  Is there
 another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
 file systems are quite common in Asia still (including non-rewritable
 media).  I think surrogate-replacement of ASCII should at least be an
 option.

It's a security risk. If U+DCXX would map to \xXX, then somebody could
embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
sanitized, nobody would expect that this will actually access ../

 1.  There is no such thing as a half-surrogate in Unicode.  Lone
 surrogate is clear enough.  Or for somewhat fancier English,
 isolated surrogate or non-syntactic surrogate.  To emphasize
 that Python codecs will only produce them in contexts where a
 Unicode character or high surrogate (for UTF-16 Python) is
 syntactically required, isolated low surrogate or isolated
 trailing surrogate might be good.[1]

Fixed. I removed the world half everywhere. It really doesn't mean
anything to me (it could have been called sunnygate instead, making
no difference).

I tried to understand surrogate, and it was explained to me that
surrogate is something that stands for something - but then I
would argue that the two subsequence codes form a surrogate - they
stand for something else. The individual surrogate code (in Unicode
terminology) doesn't stand for anything. So don't you agree that
it is the Unicode terminology that is in error, not the PEP?

 2.  The specification should state, and the discussion emphasize, that
 strings which were produced by surrogate replacement *must not* be
 used in data interchange with systems that do not specifically
 accept such strings, and that this is the responsibility of the
 application.[2]

No. The specification puts no requirements on applications whatsoever.
So if you propose to use MUST NOT in the RFC 2119 sense, I strongly
disagree.

Applications that desire mojibake are free to produce it; we are
consenting adults; and all that.

 3.  In the discussion, the transition from the example of alternative
 use of 'python-escape' to discussion of the error handler
 interface extension is a bit abrupt.  I suggest rewriting as:
 
 The extension to the encode error handler interface proposed by
 this PEP is necessary to implement the 'utf8b' error handler,
 because there are required byte sequences which cannot be
 generated from replacement Unicode.  However, the encode error
 handler interface presently requires replacement Unicode to be
 provided in lieu of the non-encodable Unicode from the source
 string.  Then it promptly encodes that replacement Unicode.  In
 some error handlers, such as the 'utf8b' proposed here, it is also
 simpler and more efficient for the error handler to provide a
 pre-encoded replacement byte string, rather than forcing it to
 calculating Unicode from which the encoder would create the
 desired bytes.

Unfortunately, I failed to understand where you want this text to
go. What paragraphs should I remove, or (if none), after which
paragraph should I insert this text?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis

 It occurs to me that the PEP maybe should say that it is an error
 to have your POSIX locale set to UTF-16 or something like that.

No. It is *impossible* to have UTF-16 as the locale character set,
not an error. Your statement is like saying it is an error to
breathe in the vacuum.

In any case, the discussion says

# Encodings that are not compatible with ASCII are not supported by
# this specification; bytes in the ASCII range that fail to decode
# will cause an exception. It is widely agreed that such encodings
# should not be used as locale charsets.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread M.-A. Lemburg

Martin v. Löwis wrote:
 I have three substantive comments.  First, although consequences for
 Python 3 byte interfaces (ie, none) are explicitly stated, as far as
 I can see this PEP could apply to Python 2 as well.  I don't think
 it's intended that way.  Either way, I think you should clarify that
 point.
 
 Done: the Python-Version header already clarifies that point.
 
 Second, I suggest surrogate-replace as the name of the error handler
 rather than utf8b.
 
 I think this is bike-shedding.

The name utf8b suggested in the PEP is not in line with the codec
design and causes confusion with an existing codec of a similar name.

Error handlers and codecs are two different things, so the namespaces
need to be clearly separate.

Please change the name of the error handler to a different name that
does not resemble or cause confusion with a codec name and fits the
scheme of error handler names we already have in place in Python for
replacing error handlers, i.e. XYZreplace.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull

Martin v. Löwis writes:
   It occurs to me that the PEP maybe should say that it is an error
   to have your POSIX locale set to UTF-16 or something like that.
  
  No. It is *impossible* to have UTF-16 as the locale character set,
  not an error. Your statement is like saying it is an error to
  breathe in the vacuum.

I realize this is not useful, so maybe you don't need to mention it.
However, it certainly is possible to set LANG with an absurd, or
merely dangerous, encoding.

  In any case, the discussion says
  
  # Encodings that are not compatible with ASCII are not supported by
  # this specification; bytes in the ASCII range that fail to decode
  # will cause an exception. It is widely agreed that such encodings
  # should not be used as locale charsets.

Which is your excuse for not supporting Shift JIS fully.  It doesn't
stop people from setting LC_ALL=ja_JP.shift_jis, or using Shift JIS as
the default encoding for certain media.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull

Lino Mastrodomenico writes:
  2009/5/5 Stephen J. Turnbull step...@xemacs.org:
   Third, it is not clear to me why non-decodable ASCII should be an
   error.
  
  The PEP originally allowed the conversion to U+DCxx of bytes below 128
  that cannot be decoded by the encoding used, but this creates
  potential security problems.
  
  See: http://mail.python.org/pipermail/python-dev/2009-April/089102.html

Yeah, yeah, this is the same old same old from PEP 3131.  Anything
that handles the various attacks based on ASCII-alike characters
should at least rule out invalid Unicode, too!

And where is this U+DC2F supposed to be coming from, anyway?  The
user's *local* environment or the user's *local* filesystem!  Codecs
not using 'utf8b' can't produce it, so the only other cases are chr()
and \u literals in the *local* process, or an already broken module in
your code.  I really can't imagine that any sane programmer these days
would be using 'utf8b' on bytes received from the Internet!

Of course I can't prove that there's no vector for an exploit here (in
fact, I'm sure there is one with sufficiently careless handling of
input), but I think consenting adults covers the Shift JIS use case.
Make it an option, but it should be explicitly part of the PEP.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull

Martin v. Löwis writes:

  Done: the Python-Version header already clarifies that point.

Ah, OK.  I wish my day job required reading more PEPs so I'd be more
familiar with these formalities. :-)

   Second, I suggest surrogate-replace as the name of the error handler
   rather than utf8b.
  
  I think this is bike-shedding.

I don't personally care (I already was aware of UTF-8B), but there are
plenty of others who do.  I think that's a good name to make
Marc-Andre and Terry happier.  You have to fix the existing uses of
the obsolete python-escape, anyway.

  It's a security risk. If U+DCXX would map to \xXX, then somebody could
  embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
  sanitized, nobody would expect that this will actually access ../

The odds that anybody will actually take notice of U+002E U+002E
U+002F in a string are sufficiently small that any number of exploits
have already been based on it.  I agree that there is some additional
risk from this if people make the check for ../ before they prepend
\ucd2e\udc2e\udc2f, but I think that risk is very small compared to
the pain of having a error handler whose raison d'etre is to not raise
exceptions go ahead and raise them anyway.

See also my reply to Lino Mastrodomenico.  Again, an option is good
enough for my purposes as long as interfaces for os.listdir() and the
like support setting the error handler (cf. Zooko's proposal), but I
think the option should be available.

  I tried to understand surrogate, and it was explained to me that
  surrogate is something that stands for something - but then I
  would argue that the two subsequence codes form a surrogate - they
  stand for something else. The individual surrogate code (in Unicode
  terminology) doesn't stand for anything. So don't you agree that
  it is the Unicode terminology that is in error, not the PEP?

Plausibly so.  Keep making comments like that and nobody will ever let
you off the hook for being a non-native speaker!

However, surrogate in English is typically used in situation that
are too complex to be covered by simply substitution.  I've always
read surrogate as alternative form of encoding, and surrogate
code point as code point in that alternative form of encoding.
Where it's an alternative to code-point-is-scalar-value.  I think
probably the authors of the terminology just made the best of a bad
situation, I can't think of a better single word for this.

  No. The specification puts no requirements on applications whatsoever.
  So if you propose to use MUST NOT in the RFC 2119 sense, I strongly
  disagree.

I do propose that.

But you're writing the PEP, so this battle will have to be deferred.
Eventually Python will have to take a stand on Unicode conformance,
but it's not urgent yet.

   3.  In the discussion, the transition from the example of alternative
   use of 'python-escape' to discussion of the error handler
   interface extension is a bit abrupt.  I suggest rewriting as:
   
   The extension to the encode error handler interface proposed by
   this PEP is necessary to implement the 'utf8b' error handler,
   because there are required byte sequences which cannot be
   generated from replacement Unicode.  However, the encode error
   handler interface presently requires replacement Unicode to be
   provided in lieu of the non-encodable Unicode from the source
   string.  Then it promptly encodes that replacement Unicode.  In
   some error handlers, such as the 'utf8b' proposed here, it is also
   simpler and more efficient for the error handler to provide a
   pre-encoded replacement byte string, rather than forcing it to
   calculating Unicode from which the encoder would create the
   desired bytes.
  
  Unfortunately, I failed to understand where you want this text to
  go. What paragraphs should I remove, or (if none), after which
  paragraph should I insert this text?

Sorry!  I suggest substituting the paragraph above for the paragraph
which begins The encode error handler interface presentlyrequires...
at line 129.

I think I forgot to do this before:  I hereby dedicate all text
I suggest for inclusion in the PEP to the public domain.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Lino Mastrodomenico

2009/5/3 Martin v. Löwis mar...@v.loewis.de:
 With issue 3672 resolved, it is now unnecessary to introduce
 an utf-8b codec, since the utf-8 codec will properly report errors
 for all byte sequences invalid in UTF-8, including lone surrogates.
 Therefore, utf-8b can be implemented solely through the error handler.

That's even nicer. One minor detail though, in the sentence:

non-decodable bytes 128 will be represented as lone half surrogate

 should be =.

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Antoine Pitrou

Martin v. Löwis martin at v.loewis.de writes:
 
 Glenn Linderman suggested that the name python-escape is not very
 descriptive, so I've changed the name to utf8b.

If the error handler is supposed to be used for codecs other than utf-8,
perhaps it should renamed something more generic, e.g. surrogate-escape?

Also, if utf8-b is not provided as a codec, will there be an easy way for user
code to use the same encoding as the IO layer does? (e.g. 
os.fsdecode/os.fsencode)?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Michael Urman

On Sun, May 3, 2009 at 08:43, Antoine Pitrou solip...@pitrou.net wrote:
 Also, if utf8-b is not provided as a codec, will there be an easy way for user
 code to use the same encoding as the IO layer does? (e.g.
 os.fsdecode/os.fsencode)?

I like the idea of fsencode/fsdecode functions, but we need to be
careful deciding what they accept and produce on Windows. I'd expect
them to be identity functions, but then the difference in platform
behavior suggests perhaps they should be in os.path.

Unicode to Unicode on Windows would further mean fsencode wouldn't be
useful for sending filenames over sockets, and utf8 will be prone to
exceptions on the very names we're trying to support right now. Is
there an advantage to not providing the the utf8b behavior as a
registered codec?

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Martin v. Löwis

 That's even nicer. One minor detail though, in the sentence:
 
 non-decodable bytes 128 will be represented as lone half surrogate
 
  should be =.

Thanks, fixed.

Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Martin v. Löwis

 If the error handler is supposed to be used for codecs other than utf-8,
 perhaps it should renamed something more generic, e.g. surrogate-escape?

Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
it's an algorithm based on 16-bit or 32-bit code points.

 Also, if utf8-b is not provided as a codec, will there be an easy way for user
 code to use the same encoding as the IO layer does? 

s.encode(os.getfilesystemencoding(), utf8b) will do just that (in
fact, that's exactly what the IO layer does).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Gregory P. Smith

On Sun, May 3, 2009 at 10:39 AM, Martin v. Löwis mar...@v.loewis.dewrote:

  If the error handler is supposed to be used for codecs other than utf-8,
  perhaps it should renamed something more generic, e.g.
 surrogate-escape?

 Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
 it's an algorithm based on 16-bit or 32-bit code points.


To me that lack of relationship with utf8 suggests that it should not be
called utf8b...  But I don't have any good suggestions.



  Also, if utf8-b is not provided as a codec, will there be an easy way for
 user
  code to use the same encoding as the IO layer does?

 s.encode(os.getfilesystemencoding(), utf8b) will do just that (in
 fact, that's exactly what the IO layer does).

 Regards,
 Martin
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/greg%40krypto.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Martin v. Löwis

  If the error handler is supposed to be used for codecs other than
 utf-8,
  perhaps it should renamed something more generic, e.g.
 surrogate-escape?
 
 Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
 it's an algorithm based on 16-bit or 32-bit code points.
 
 
 To me that lack of relationship with utf8 suggests that it should not be
 called utf8b

Perhaps. However, giving it that name was Markus Kuhn's choice - and
while it may be confusing, it's (IMO) useful to be consistent with this
background.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Gregory P. Smith

On Sun, May 3, 2009 at 1:27 PM, Martin v. Löwis mar...@v.loewis.dewrote:

   If the error handler is supposed to be used for codecs other than
  utf-8,
   perhaps it should renamed something more generic, e.g.
  surrogate-escape?
 
  Perhaps. However, utf-8b doesn't really have to do anything with
 utf-8 -
  it's an algorithm based on 16-bit or 32-bit code points.
 
 
  To me that lack of relationship with utf8 suggests that it should not be
  called utf8b

 Perhaps. However, giving it that name was Markus Kuhn's choice - and
 while it may be confusing, it's (IMO) useful to be consistent with this
 background.

 Regards,
 Martin


Ah, right.  My original searches for utf8b didn't turn up much but searching
on his name turns some up.  Good choice of name then.

 http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
 http://bsittler.livejournal.com/10381.html
 http://hyperreal.org/~est/utf-8b/

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

94 matches

Mail list logo