Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Hagen Fürstenau
> If the Unicode APIs only have correct unicode, sure.  If not you'll
> get errors translating to UTF-8 (and the byte APIs are supposed to
> pass bad names through unaltered.)  Kinda ironic, no?

As far as I can see all Python Unicode strings can be encoded to UTF-8,
even things like lone surrogates because Python doesn't care about them.
So both the Unicode API and the binary API would be fail-safe on Windows.

- Hagen

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen
On Sun, Dec 7, 2008 at 2:07 AM, Hagen Fürstenau <[EMAIL PROTECTED]> wrote:
>> If the Unicode APIs only have correct unicode, sure.  If not you'll
>> get errors translating to UTF-8 (and the byte APIs are supposed to
>> pass bad names through unaltered.)  Kinda ironic, no?
>
> As far as I can see all Python Unicode strings can be encoded to UTF-8,
> even things like lone surrogates because Python doesn't care about them.
> So both the Unicode API and the binary API would be fail-safe on Windows.

Python is broken and needs to be fixed.

http://bugs.python.org/issue3672
http://bugs.python.org/issue3297


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Hagen Fürstenau
>> As far as I can see all Python Unicode strings can be encoded to UTF-8,
>> even things like lone surrogates because Python doesn't care about them.
>> So both the Unicode API and the binary API would be fail-safe on Windows.
> 
> Python is broken and needs to be fixed.
> 
> http://bugs.python.org/issue3672
> http://bugs.python.org/issue3297

But the question of whether Python should care about lone surrogates or
not is at best tangential to the issue at hand.  If you have lone
surrogates in the Unicode API (and didn't raise an exception on the way
getting there), then the sensible thing is to encode them into lone
UTF-8 surrogates.  Even if you wanted to prevent lone surrogates,
encoding to UTF-8 for the binary API would not be the place to enforce it.

- Hagen
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rewrite map for old URLs in place

2008-12-07 Thread Nick Coghlan
Georg Brandl wrote:
> Hi,
> 
> with a bit of delay I finally got around to creating a mod_rewrite map of
> the 2.5 URLs.  URLs like http://docs.python.org/tut/node3.html will now
> point permanently to the new URL.
> 
> Let me know if you find a problem.

Excellent news!

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Rewrite map for old URLs in place

2008-12-07 Thread Georg Brandl
Hi,

with a bit of delay I finally got around to creating a mod_rewrite map of
the 2.5 URLs.  URLs like http://docs.python.org/tut/node3.html will now
point permanently to the new URL.

Let me know if you find a problem.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Steve Holden
Brett Cannon wrote:
> On Sat, Dec 6, 2008 at 15:41, Barry Warsaw <[EMAIL PROTECTED]> wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> On Dec 6, 2008, at 6:25 PM, Guido van Rossum wrote:
>>
>>> On Sat, Dec 6, 2008 at 3:18 PM, Benjamin Peterson
>>> <[EMAIL PROTECTED]> wrote:
 Since the release of 3.0, several critical issues have come to our
 attention. Namely, the builtin cmp function wasn't removed [1] and the
 new IO library proved to be (as expected) abysmally slow [2][3][4].
 Christian proposed that we release 3.0.1 within the next week to patch
 up this critical issues. Thoughts?


 [1] http://bugs.python.org/1717
 [2] http://bugs.python.org/4533
 [3] http://bugs.python.org/4561
 [4] http://bugs.python.org/4565
>> I've set the priority on all these to release blockers, but I have my
>> reservations about 4561 and 4565.  Resolution of those seem like more than a
>> week or so away.
>>
>> If we want to do a bug fix release for 3.0.1, I'd like to do it no later
>> than the 19th.
>>
> 
> +1 just to get rid of cmp(). And if io speedups can happen, great, but
> they can also wait for 3.0.2.
> 
A point release just to remove a function whose withdrawal has been
advertised as a 3.0 change hardly seems worth the substantial effort of
cutting a release. If cmp() shouldn't have been in 3.0 and was then
there's surely no problem about removing it later as promised: anyone
who uses it in 3.0 code shouldn't be.

If it doesn't have to wait for a major release then is there any real
need to cut the minor release immediately?

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] distutils patches, request for review

2008-12-07 Thread Tarek Ziadé
Hi,

I am looking for a core developer to review a few patches for distutils.

#1 is mandatory (it removes a bad bug)
#2 is very nice to have
#3 to #5 are test coverage and code beautication

In order:

1. #4400 : the default generated .pypirc is broken. This patch fixes
it: http://bugs.python.org/issue4400
2. #4394 : no need to store the password in pypirc anymore : using the
prompt if not stored. http://bugs.python.org/issue4394
3. #2461 : more test coverage. http://bugs.python.org/issue2461
4. #3992 : removes custom log implementation -> uses logging instead.
http://bugs.python.org/issue3992
5. #3985 : more cleanup. http://bugs.python.org/issue3985
6. #3986 : http://bugs.python.org/issue3986

Some of them are a few month old so I can refresh the patch on the
current trunk(s) as soon as they are picked.

Regards
Tarek

-- 
Tarek Ziadé | Association AfPy | www.afpy.org
Blog FR | http://programmation-python.org
Blog EN | http://tarekziade.wordpress.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Guido van Rossum
On Sun, Dec 7, 2008 at 5:38 AM, Steve Holden <[EMAIL PROTECTED]> wrote:
> A point release just to remove a function whose withdrawal has been
> advertised as a 3.0 change hardly seems worth the substantial effort of
> cutting a release. If cmp() shouldn't have been in 3.0 and was then
> there's surely no problem about removing it later as promised: anyone
> who uses it in 3.0 code shouldn't be.
>
> If it doesn't have to wait for a major release then is there any real
> need to cut the minor release immediately?

Well, since 2to3 doesn't remove cmp, and it actually works, it's
likely that people will be accidentally depending on it in code
converted from 2.x. In the past, where there was a discrepancy between
docs and code, we've often ruled in favor of the code using arguments
like "it always worked like this so we'll break working code if we
change it now". There's clearly an argument of timeliness there, which
is why we'd like to get this fixed ASAP. The alternative, which nobody
likes, would be to keep it around, deprecate it in 3.1, and remove it
in 3.2 or 3.3.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen
On Sun, Dec 7, 2008 at 2:35 AM, Hagen Fürstenau <[EMAIL PROTECTED]> wrote:
>>> As far as I can see all Python Unicode strings can be encoded to UTF-8,
>>> even things like lone surrogates because Python doesn't care about them.
>>> So both the Unicode API and the binary API would be fail-safe on Windows.
>>
>> Python is broken and needs to be fixed.
>>
>> http://bugs.python.org/issue3672
>> http://bugs.python.org/issue3297
>
> But the question of whether Python should care about lone surrogates or
> not is at best tangential to the issue at hand.  If you have lone
> surrogates in the Unicode API (and didn't raise an exception on the way
> getting there), then the sensible thing is to encode them into lone
> UTF-8 surrogates.  Even if you wanted to prevent lone surrogates,
> encoding to UTF-8 for the binary API would not be the place to enforce it.

No.  Unicode *requires* them to be treated as errors.  If you want to
pass them through then you're creating a custom encoding... which you
might argue for in this case, but it needs to be clearly separate from
the real UTF-8.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Toshio Kuratomi
[EMAIL PROTECTED] wrote:
> 
> On 06:07 am, [EMAIL PROTECTED] wrote:
>> Most apps aren't file managers or ftp clients but when they interact
>> with files (for instance, a file selection dialog) they need to be able
>> to show the user all the relevant files.  So on an app-by-app basis the
>> need for this is high.
> 
> While I tend to agree emphatically with this, the *real* solution here
> is a path-abstraction library.

Why don't you send me some information offlist.  I'm not sure I agree
that a path-abstraction library can work correctly but if it can it
would be nice to have that at a level higher than the file-dialog
libraries that I was envisioning.

[snip]

>> ... but that still
>> doesn't help me identify when someone would expect that asking python
>> for a list of all files in a directory or a specific set of files in a
>> directory should, without warning, return only a subset of them.  In
>> what situations is this appropriate behaviour?
> 
> If you say listdir(unicode) on a POSIX OS, your program is saying "I
> only know how to deal with unicode results from this function, so please
> only give me those.".

No.  (explained below)

>  If your program is smart enough to deal with
> bytes, then you would have asked for bytes, no?

Yes (explained below)

>  Returning only
> filenames which can be properly decoded makes sense.  Otherwise everyone
> needs to learn about this highly confusing issue, even for the simplest
> scripts.
>
os.listdir(unicode) (currently) means that the *programmer* is asking
that the stdlib return the decodable filenames from this directory.  The
question is whether the programmer understood that this is what they
were asking for and whether it is what they most likely want.  I would
make the following statements WRT to this:

1) The programmer most likely does not want decodable filenames and only
decodable filename.  If they were, we'd see a lot of python2.x code that
turns pathnames into unicode and discards everything that wasn't
decodable.  No one has given a use case for finding only the *decodable*
subset of files.  If I request to see all *.py files in a directory, I
want to see all of the *.py files in the directory, decodable or not.
If you can show how programmers intend "90%" of their calls to
os.listdir()/glob.glob('*.txt') to show only the decodable subset of the
results, then the foundation of my arguments is gone.  So please, give
examples to prove this wrong.

  - If this is true, a definition of os.listdir() that would
better meet programmer expectation would be: "Give me all files in a
directory with the output as str type".  The definition of
os.listdir() would be "Give me all files in a directory
with the output as bytes type".  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.

2) For the programmer to understand the difference between
os.listdir() and os.listdir() they have to
understand the "highly confusing issue" and what it means for their
code.  So the current method is forcing programmers to understand it
even for the simplest scripts if their environment is not uniform with
no clue from the interpreter that there is an issue.

  - Similarly, raising an exception on undecodable values means that the
programmer can ignore the issue in any scripts in sane environments and
will be told that they need to deal with it (via an exception) when
their script runs in a non-sane environment.

3) The usage of unicode vs bytes is easy to miss for someone starting
with py2.x or windows and moving to a multi-platform or unix project.
Even simple testing won't reveal the problem unless the programmer knows
that they have to test what happens when encodings are mixed.  Once
again, this is requiring the programmer to understand the encoding issue
 without help from the interpreter.

> Skipping undecodable values is good enough that it will work 90% of the
> time.

You and Guido have now made this claim to defend not raising an
exception but I still don't have a use case.

Here are use cases that I see:

* Bill is coding an application for use inside his company.  His company
only uses utf-8.  His code naively uses os.listdir().

  - The code does not throw an exception whether we use the current
os.listdir() or one that could throw an exception because the system
admins have sanitised the environment.  Bill did not need to understand
the implications of encoding for his code to work in this script whether
simple or complex.

* Mary is coding an application for use inside her company.  It finds
all html files on a system and updates her company's copyright, privacy
policy, and other legal boilerplate.  Her expectation is that after her
program runs every file will have been updated.  Her environment is a
mixture of different filename encodings due to having many legacy
documents for users in different locales.  Mary's code also naively uses
os.listdir().  Her test case checks that the code does the
right thing on m

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Michael Urman
On Sun, Dec 7, 2008 at 11:35, Adam Olsen <[EMAIL PROTECTED]> wrote:
>>> http://bugs.python.org/issue3672
>>> http://bugs.python.org/issue3297
>
> No.  Unicode *requires* them to be treated as errors.  If you want to
> pass them through then you're creating a custom encoding... which you
> might argue for in this case, but it needs to be clearly separate from
> the real UTF-8.

I suspect it is a common and convenient but (according to what you
say) misconceived expectation that using UTF-8 to encode any Unicode
string will not raise an exception. This behavior is not something
which should be discarded lightly.

I see little reason that this couldn't be a new codec or error handler
that allowed people to choose between correct pure UTF-8 behavior or
the technically incorrect but very practical behavior it currently
has.

[My apologies, Adam, for sending this only to you the first time]
-- 
Michael Urman
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen
On Sun, Dec 7, 2008 at 11:18 AM, Michael Urman <[EMAIL PROTECTED]> wrote:
> On Sun, Dec 7, 2008 at 11:35, Adam Olsen <[EMAIL PROTECTED]> wrote:
 http://bugs.python.org/issue3672
 http://bugs.python.org/issue3297
>>
>> No.  Unicode *requires* them to be treated as errors.  If you want to
>> pass them through then you're creating a custom encoding... which you
>> might argue for in this case, but it needs to be clearly separate from
>> the real UTF-8.
>
> I suspect it is a common and convenient but (according to what you
> say) misconceived expectation that using UTF-8 to encode any Unicode
> string will not raise an exception. This behavior is not something
> which should be discarded lightly.

It is *not* a valid Unicode string in the first place.  Therein lies
the problem.


> I see little reason that this couldn't be a new codec or error handler
> that allowed people to choose between correct pure UTF-8 behavior or
> the technically incorrect but very practical behavior it currently
> has.

Note that many of the restrictions were added for security reasons.
You might receive a UTF-8 encoded file name from a malicious user,
check if it contains something dangerous (like
"../../../../../etc/password"), then decode it.  If your decoder isn't
compliant (ie doesn't check for overly long sequences) then a
b'\xC0\xAF' gets translated into u'/', bypassing your previous check.

However, in this context we only need to allow lone surrogates.
CESU-8 comes to mind.  (It is a perverse world we live in.)

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] "as" keyword woes

2008-12-07 Thread Paul Boddie
On Sat Dec 6 21:29:09 CET 2008, Guido van Rossum wrote:
>
> On Sat, Dec 6, 2008 at 11:38 AM, Warren DeLano 
> wrote:
> > As someone somewhat knowledgable of how parsers work, I do not
> > understand why a method/attribute name "object_name.as(...)" must
> > necessarily conflict with a standalone keyword " as ".  It seems to me
> > that it should be possible to unambiguously separate the two without
> > ambiguity or undue complication of the parser.
>
> That's possible with sufficiently powerful parser technology, but
> that's not how the Python parser (and most parsers, in my experience)
> treat reserved words. Reserved words are reserved in all contexts,
> regardless of whether ambiguity could arise.

Just a quick aside from someone who merely lurks on this list: in SQL, it's 
quite possible to use keywords in a fashion similar to that desired by the 
inquirer, and it's actually possible to double-quote keywords and use them as 
names for things. I'm not advocating more complicated parsing technology for 
any Python implementation, but I think it's pertinent to point out that the 
technology isn't particularly obscure.

Apologies for the interruption,

Paul
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Martin v. Löwis
> There's clearly an argument of timeliness there, which
> is why we'd like to get this fixed ASAP.

I think it is still timely when fixed in January or February.
In fact, releasing it still in December might not be possible,
due to the limited time available.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Terry Reedy

Toshio Kuratomi wrote:


  - If this is true, a definition of os.listdir() that would
better meet programmer expectation would be: "Give me all files in a
directory with the output as str type".  The definition of
os.listdir() would be "Give me all files in a directory
with the output as bytes type".  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.


Your examples (snipped) pretty well convince me that there is a use case 
for raising exceptions.  We should move beyond arguing over which one 
way is right.  I think there should be a second argument 
'ignorebad=False' to ignore undecodable files rather than raise the 
exception (or 'strict=True' to stop and raise exception on non-decodable 
names -- then code is 'if strict: raise ...').  I believe other 
functions have a similar parameter.


tjr

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Guido van Rossum
On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
> Toshio Kuratomi wrote:
>
>>  - If this is true, a definition of os.listdir() that would
>> better meet programmer expectation would be: "Give me all files in a
>> directory with the output as str type".  The definition of
>> os.listdir() would be "Give me all files in a directory
>> with the output as bytes type".  Raising an exception when the filenames
>> are undecodable is perfectly reasonable in this situation.
>
> Your examples (snipped) pretty well convince me that there is a use case for
> raising exceptions.  We should move beyond arguing over which one way is
> right.  I think there should be a second argument 'ignorebad=False' to
> ignore undecodable files rather than raise the exception (or 'strict=True'
> to stop and raise exception on non-decodable names -- then code is 'if
> strict: raise ...').  I believe other functions have a similar parameter.

If you want the exceptions, just use the bytes API and try to decode
the byte strings using the system encoding.

My problem with raising exceptions *by default* when an undecodable
name exists is that it may render an app completely useless in a
situation where the developer is no longer around. This happened all
the time with the 2.x Unicode API, where the developer hadn't
anticipated a particular input potentially containing non-ASCII bytes,
and the user fed the application non-ASCII text. Making os.listdir
raise an exception when a directory contains a single undecodable file
means that the entire directory can't be read, and most likely the
entire app crashes at that point. Most likely the developer never
anticipated this situation (since in most places it is either
impossible or very unlikely) -- after all, if they had anticipated it
they would have used the bytes API in the first place. (It's worse
because the exception being raised would be UnicodeError -- most
people expect os.listdir to raise OSError, not other errors.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Nonlocal shortcut

2008-12-07 Thread Fabio Zadrozny
Hi,

I'm currently implementing a parser to handle Python 3.0, and one of
the points I found conflicting with the grammar specification is the
PEP 3104.

It says that a shortcut would be added to Python 3.0 so that "nonlocal
x = 0" can be written. However, the latest grammar specification
(http://docs.python.org/dev/3.0/reference/grammar.html?highlight=full%20grammar)
doesn't seem to take that into account... So, can someone enlighten me
on what should be the correct treatment for that on a grammar that
wants to support Python 3.0?

Thanks,

Fabio
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Nick Coghlan
Terry Reedy wrote:
> Toshio Kuratomi wrote:
> 
>>   - If this is true, a definition of os.listdir() that would
>> better meet programmer expectation would be: "Give me all files in a
>> directory with the output as str type".  The definition of
>> os.listdir() would be "Give me all files in a directory
>> with the output as bytes type".  Raising an exception when the filenames
>> are undecodable is perfectly reasonable in this situation.
> 
> Your examples (snipped) pretty well convince me that there is a use case
> for raising exceptions.  We should move beyond arguing over which one
> way is right.  I think there should be a second argument
> 'ignorebad=False' to ignore undecodable files rather than raise the
> exception (or 'strict=True' to stop and raise exception on non-decodable
> names -- then code is 'if strict: raise ...').  I believe other
> functions have a similar parameter.

If we were going to do anything like that for os.listdir() and other
filesystem APIs (like glob) that return multiple paths, we'd probably be
best advised to just have a normal Unicode 'errors' parameter which allowed:

'strict' - raise an Exception for malformed binary data
'replace' - insert '?' or some other symbol in place of malformed binary
data
'ignore' - simply leave out the malformed binary data
'skip' - run the underlying codec in strict mode, but skip over any
items which raise UnicodeDecodeError (default/current Py3k behaviour)

Obviously, 'skip' doesn't make any sense for APIs like getcwd() that
return a single value - a case could be made for those defaulting to
either replace or strict.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Nonlocal shortcut

2008-12-07 Thread Amaury Forgeot d'Arc
Hello,

Fabio Zadrozny  wrote:
> Hi,
>
> I'm currently implementing a parser to handle Python 3.0, and one of
> the points I found conflicting with the grammar specification is the
> PEP 3104.
>
> It says that a shortcut would be added to Python 3.0 so that "nonlocal
> x = 0" can be written. However, the latest grammar specification
> (http://docs.python.org/dev/3.0/reference/grammar.html?highlight=full%20grammar)
> doesn't seem to take that into account... So, can someone enlighten me
> on what should be the correct treatment for that on a grammar that
> wants to support Python 3.0?

An issue was already filed about this:
http://bugs.python.org/issue4199
It should be ready for inclusion in 3.0.1.

-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Greg Ewing

Nick Coghlan wrote:


For binary wrappers around the Windows Unicode APIs, I was thinking
specifically of using UTF-8, since that should be able to encode
anything the Unicode APIs can handle.


Why shouldn't the binary interface just expose the raw
utf16 as bytes?

--
Greg
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Terry Reedy

Guido van Rossum wrote:

On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:

Toshio Kuratomi wrote:


 - If this is true, a definition of os.listdir() that would
better meet programmer expectation would be: "Give me all files in a
directory with the output as str type".  The definition of
os.listdir() would be "Give me all files in a directory
with the output as bytes type".  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.

Your examples (snipped) pretty well convince me that there is a use case for
raising exceptions.  We should move beyond arguing over which one way is
right.  I think there should be a second argument 'ignorebad=False' to
ignore undecodable files rather than raise the exception (or 'strict=True'
to stop and raise exception on non-decodable names -- then code is 'if
strict: raise ...').  I believe other functions have a similar parameter.


I was thinking of the "normal Unicode 'errors' parameter", as described 
by Nick.



If you want the exceptions, just use the bytes API and try to decode
the byte strings using the system encoding.


If it was a matter of adding a new method, I might agree.  But:

1. We already have a method that does exactly what you describe.  It is 
only a matter of adding flexibility to the response to problems, for 
which there is already precedent.


2. Suggesting that people who want strings and not bytes should have to 
deal with bytes, just to get an error notification, seems to negate that 
point of moving to 3.0


3. A builtin would probably do so better than most programmers would, 
with little touches such as the one suggested below.


4. An error parameter would ALERT programmers to the possibility of a 
PROBLEM, both in the present and future.  As you say below, people need 
to better anticipate the future.



My problem with raising exceptions *by default* when an undecodable
name exists is that it may render an app completely useless in a
situation where the developer is no longer around. This happened all
the time with the 2.x Unicode API, where the developer hadn't
anticipated a particular input potentially containing non-ASCII bytes,
and the user fed the application non-ASCII text. Making os.listdir
raise an exception when a directory contains a single undecodable file
means that the entire directory can't be read, and most likely the
entire app crashes at that point. Most likely the developer never
anticipated this situation (since in most places it is either
impossible or very unlikely) -- after all, if they had anticipated it
they would have used the bytes API in the first place. (It's worse
because the exception being raised would be UnicodeError -- most
people expect os.listdir to raise OSError, not other errors.)


This to be is an argument for keeping the default the current behavior, 
but not for rejecting flexibility.  The computing world seems to be 
messier than we would like and worse that I realized until this week. 
As you say below, people need to better anticipate the future, and an 
errors parameter would help do that.



Is Windows really immune?  What about when it reads the directory of 
possibly old removable media with whatever byte name encodings?  Is this 
a possible source of 'unanticipated' problems?


As to your last sentence, os.listdir() with an errors parameter could 
convert a decoding UnicodeError to "OSError: undecodable file name 
", thereby supplying the expected exception as well as 
an extractable representation of problematical the raw bytes


Here is a possible use case: I want filenames as 3.0 strings and I 
anticipate no problems at present but, as you say above, something might 
happen years in the future.  I am using 3.0 *because* of the strings == 
unicode feature.  I would like to write


try:
  files = os.listdir(somedir, errors = strict)
except OSError as e:
  log()
  files = os.listdir(somedir)

and go one without the problem file but not without logging the problem 
so a future maintainer can consider what to do about it, but only when 
there is an actual need to think about it.


Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Nonlocal shortcut

2008-12-07 Thread Terry Reedy

Fabio Zadrozny wrote:

Hi,

I'm currently implementing a parser to handle Python 3.0, and one of
the points I found conflicting with the grammar specification is the
PEP 3104.

It says that a shortcut would be added to Python 3.0 so that "nonlocal
x = 0" can be written. 


As near as I can tell from testing, that did not happen. The PEP needs 
revision to delete that or push it to a later version.


> However, the latest grammar specification

(http://docs.python.org/dev/3.0/reference/grammar.html?highlight=full%20grammar)
doesn't seem to take that into account... So, can someone enlighten me
on what should be the correct treatment for that on a grammar that
wants to support Python 3.0?


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Christian Heimes

Martin v. Löwis wrote:

I think it is still timely when fixed in January or February.
In fact, releasing it still in December might not be possible,
due to the limited time available.


The cmp() / PyObject_Compare() removal patch is almost done. With some 
help I can finish it until Tuesday evening. We can have another release 
by Monday Dec 15th. Python 3.0.0 has some defects that should be fixed 
before people are spending their Xmas holidays with 3.0. The defects include


* cmp(), PyObject_Compare() and frieds
* global/nonlocal shortcuts (global x = 0) aren't working
* unnecessary slowdown of read() due slow buffer resizing.

An early 3.0.1 release makes it possible to sync 2.6 and 3.0 relases 
again. If we release it now we can have an combined release of 2.6.2 and 
3.0.2 in two months from now. Two months are quite some time to fix the 
performance issue of the new IO library.


If Guido and Barry are fine with a lax policy on performance fixes we 
can integrate more tweaks. I believe performances patches were 
considered as features in the past. For this reason they weren't allowed 
for minor releases. Mark's work on long integer optimizations and json 
speedup are good candidates.


Christian
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Benjamin Peterson
On Sun, Dec 7, 2008 at 6:05 PM, Christian Heimes <[EMAIL PROTECTED]> wrote:
> Martin v. Löwis wrote:
>>
>> I think it is still timely when fixed in January or February.
>> In fact, releasing it still in December might not be possible,
>> due to the limited time available.
>
> The cmp() / PyObject_Compare() removal patch is almost done. With some help
> I can finish it until Tuesday evening. We can have another release by Monday
> Dec 15th. Python 3.0.0 has some defects that should be fixed before people
> are spending their Xmas holidays with 3.0. The defects include
>
> * cmp(), PyObject_Compare() and frieds
> * global/nonlocal shortcuts (global x = 0) aren't working

I have a patch for this [1], but I don't think this should be
considered a release blocker or even backported to 3.0. It's merely a
convenience feature and doesn't inhibit the usefulness of the PEP in
any way.

> * unnecessary slowdown of read() due slow buffer resizing.




-- 
Cheers,
Benjamin Peterson
"There's nothing quite as beautiful as an oboe... except a chicken
stuck in a vacuum cleaner."
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Christian Heimes

Benjamin Peterson wrote:

I have a patch for this [1], but I don't think this should be
considered a release blocker or even backported to 3.0. It's merely a
convenience feature and doesn't inhibit the usefulness of the PEP in
any way.


Amaury said:
An issue was already filed about this:
http://bugs.python.org/issue4199
It should be ready for inclusion in 3.0.1.

I'm +0 for the patch. Given the nature of Python 3.0 I'm fine with 
getting it right.


Christian
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Barry Warsaw

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Dec 7, 2008, at 7:05 PM, Christian Heimes wrote:


Martin v. Löwis wrote:

I think it is still timely when fixed in January or February.
In fact, releasing it still in December might not be possible,
due to the limited time available.


The cmp() / PyObject_Compare() removal patch is almost done. With  
some help I can finish it until Tuesday evening. We can have another  
release by Monday Dec 15th. Python 3.0.0 has some defects that  
should be fixed before people are spending their Xmas holidays with  
3.0. The defects include


* cmp(), PyObject_Compare() and frieds
* global/nonlocal shortcuts (global x = 0) aren't working
* unnecessary slowdown of read() due slow buffer resizing.

An early 3.0.1 release makes it possible to sync 2.6 and 3.0 relases  
again. If we release it now we can have an combined release of 2.6.2  
and 3.0.2 in two months from now. Two months are quite some time to  
fix the performance issue of the new IO library.


If Guido and Barry are fine with a lax policy on performance fixes  
we can integrate more tweaks. I believe performances patches were  
considered as features in the past. For this reason they weren't  
allowed for minor releases. Mark's work on long integer  
optimizations and json speedup are good candidates.


I'm personally okay with performance fixes in point releases, as long  
it doesn't change API or add additional features.


- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSTxv5XEjvBPtnXfVAQIu6AQAkxyGwhapcREx5/E3yHUf8lWvM4lh/FdR
AfHwwp7hs+yX8rR05CWAUfllY9dHcHKHvBCwTCgfuIrc4GJWbJHcx9/b19GTpzre
7fcikjQ0sk6zUq85DiJah7qL5AkA6Jmiby+rol7iudHlmQO/+6F6+aeL+vSKG8IC
vYbLILAFapI=
=ScYg
-END PGP SIGNATURE-
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Christian Heimes

Barry Warsaw wrote:
I'm personally okay with performance fixes in point releases, as long it 
doesn't change API or add additional features.


Does your okay include or exclude new internal APIs like new helper 
functions or a new C modules?


Christian
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Nonlocal shortcut

2008-12-07 Thread Fabio Zadrozny
>> I'm currently implementing a parser to handle Python 3.0, and one of
>> the points I found conflicting with the grammar specification is the
>> PEP 3104.
>>
>> It says that a shortcut would be added to Python 3.0 so that "nonlocal
>> x = 0" can be written. However, the latest grammar specification
>> (http://docs.python.org/dev/3.0/reference/grammar.html?highlight=full%20grammar)
>> doesn't seem to take that into account... So, can someone enlighten me
>> on what should be the correct treatment for that on a grammar that
>> wants to support Python 3.0?
>
> An issue was already filed about this:
> http://bugs.python.org/issue4199
> It should be ready for inclusion in 3.0.1.
>

Thanks for pointing that out.

Fabio
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Glenn Linderman
On approximately 12/7/2008 10:56 AM, came the following characters from 
the keyboard of Adam Olsen:



You might receive a UTF-8 encoded file name from a malicious user,
check if it contains something dangerous (like
"../../../../../etc/password"), then decode it.  If your decoder isn't
compliant (ie doesn't check for overly long sequences) then a
b'\xC0\xAF' gets translated into u'/', bypassing your previous check.



You might indeed.

But if you are interested in checking for security issues, shouldn't you 
 _first_ decode into some canonical form, specifying what sorts of 
Unicode strictness (such as overlong sequences) to check for during the 
decode process, and once the string is in canonical form, _then_ do 
checks for various attacks, such as the ../ sequence you mention?


And with that order of operation, even if you don't reject overlong 
sequences, you have canonized them, and can recognize the resulting 
characters as good or bad.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RELEASED Python 3.0 final

2008-12-07 Thread Stephen J. Turnbull
[EMAIL PROTECTED] writes:

 > But still, you can't honestly expect me to recommend 3.0 until someone 
 > has gotten at least a basic skeleton of Twisted up and running under it 
 > :).  My own attempts to do so have failed miserably, to the point where 
 > I can't even produce a useful bug report without a lot more work.

How about an issue in the Python tracker---or the Twisted one, with a
xref from the Python tracker to the Twisted tracker where the work
will be done---that says "Twisted wants to be ported but we don't have
enough developers, please help"?  Maybe with some encouraging
statement about how you can provide X amount of advice.

In general, maybe there should be some sort of (semi-)formal process
for proposing ports of libraries and coordinating work on them.  Even
just a focal point for where to make such requests, and a way to
saerch for them so you can find others with similar interests.

 > I don't think there's anything about the 3.0 language which
 > couldn't be supported in a VM that understood both 2 and 3.

Strings vs. bytes.  It can't do both 2-style "bytes are text"
and 3-style "no way are bytes text" simultaneously AFAICS.

 > I also don't think 3.0 is perfect, and five years on, there will be
 > a temptation to make more "just this once" incompatible changes.
 > Of course, you've promised these changes won't be made, and *this*
 > set of design mistakes will be with us forever.

For values of "forever" approximating ten years.

 > It would be nice if there were a way for evolution to continue
 > without another reboot of the world.

Stephen J. Gould says not.

I think Java is a very different case from Python.  It is the product
of a language evolution that goes back to the early 1970s or so, and
the standardization effort was carefully shepherded by a powerful
company which provided resources to ensure that things went its way.

For that reason, I think it's a remarkable compliment to Python and to
Python 3 in particular that you consider Java an appropriate standard
of comparison for Python.

There's also the danger of stasis.  I think Lisp will never die, and
Common Lisp has done a good job of avoiding reboots.  But for
precisely that reason there continues to be a lively evolution of
seriously incompatible dialects, both Lisp-1 (Scheme) and Lisp-2.  I
see Python 3 as an attempt to bridle and ride this tiger, without
turning the rope into a noose and strangling the beast.

 > >If they're that easily convinced that Java is better they probably
 > >were a lost cause anyway, so I won't mourn their departure too much.
 > 
 > I really believe that *all* new users are fickle, if they don't have a 
 > mandate as to what they need to be learning.  Personally, I learned 
 > Python because of a memory leak in Swing.

Sure, but what Guido is saying, I think, is that as long as prominent
Python developers don't announce its funeral, the other things we
could do to encourage them are going to get lost in the noise of
inherent fickleness.  Which isn't just random, it depends on things
like availability of just the right library for one's app, etc.  But
there are too many of those to do them all, or even just to list them
up and try to prioritize them "objectively"---might as well be random.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Stephen J. Turnbull
Glenn Linderman writes:

 > But if you are interested in checking for security issues, shouldn't you 
 >   _first_ decode into some canonical form,

Yes.  That's all that is being asked for: that Python do strict
decoding to a canonical form by default.  That's a lot to ask, as it
turns out, but that is what we (the minority of strict Unicode
adherents, that is) want.

If you want the convenience and risk, I believe you should ask for it
by name (I suggest a name like "own_me" for the relaxed decoding
flag).  Failing that, it would be nice to have a global flag to
change the default.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 3.0.1 possibilities

2008-12-07 Thread Martin v. Löwis
>> I think it is still timely when fixed in January or February.
>> In fact, releasing it still in December might not be possible,
>> due to the limited time available.
> 
> The cmp() / PyObject_Compare() removal patch is almost done.

I wasn't (primarily) talking about fixing this particular issue.
Time needs to be made available also for the upcoming 2.4.6 and 2.5.3
releases (which should, IMO, get priority over a 3.0 bugfix release
at this point)

> With some
> help I can finish it until Tuesday evening. We can have another release
> by Monday Dec 15th. Python 3.0.0 has some defects that should be fixed
> before people are spending their Xmas holidays with 3.0. The defects
> include
> 
> * cmp(), PyObject_Compare() and frieds
> * global/nonlocal shortcuts (global x = 0) aren't working
> * unnecessary slowdown of read() due slow buffer resizing.

I think 3.0.1 should also address other serious bugs in 3.0, such
as
- various IDLE bugs with non-ASCII characters (2827, 4008, 4323, 4410)
- various ways to crash Python through the buffer protocol
  (4583, 4509; also 4580)

> An early 3.0.1 release makes it possible to sync 2.6 and 3.0 relases
> again.

IIUC, you want the bugfix version number to be sync'ed. I don't
think that is a useful thing to have.

> If Guido and Barry are fine with a lax policy on performance fixes we
> can integrate more tweaks. I believe performances patches were
> considered as features in the past. For this reason they weren't allowed
> for minor releases. Mark's work on long integer optimizations and json
> speedup are good candidates.

I don't recall such policy, and I can't see anything wrong with
including performance fixes in a bug fix release. Maybe you were
confusing this with whether performance fixes can be considered
release-critical (which they shouldn't, IMO)?

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Glenn Linderman
On approximately 12/7/2008 8:13 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:

Glenn Linderman writes:

 > But if you are interested in checking for security issues, shouldn't you 
 >   _first_ decode into some canonical form,


Yes.  That's all that is being asked for: that Python do strict
decoding to a canonical form by default.  That's a lot to ask, as it
turns out, but that is what we (the minority of strict Unicode
adherents, that is) want.



I have no problem with having strict validation available.  But doesn't 
validation take significantly longer than decoding?  So I think it 
should be logically decoupled... do validation when/where it is needed 
for security reasons, and allow internal [de]coding to be faster.


I'm mostly indifferent about which should be the default... maybe there 
shouldn't be a default!  Use the "vUTF-8" decoder for strict validation, 
and the "fUTF-8" decoder for the faster, non-validating version.  Or 
something like that.  With appropriate documentation.  Of course, 
"UTF-8" already exists... as "fUTF-8", so for compatibility, I guess it 
shouldn't change... but it could be deprecated.



You didn't address the issue that if the decoding to a canonical form is 
done first, many of the insecurities just go away, so why throw errors?



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen
On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman <[EMAIL PROTECTED]> wrote:
> On approximately 12/7/2008 8:13 PM, came the following characters from the
> keyboard of Stephen J. Turnbull:
>>
>> Glenn Linderman writes:
>>
>>  > But if you are interested in checking for security issues, shouldn't
>> you  >   _first_ decode into some canonical form,
>>
>> Yes.  That's all that is being asked for: that Python do strict
>> decoding to a canonical form by default.  That's a lot to ask, as it
>> turns out, but that is what we (the minority of strict Unicode
>> adherents, that is) want.
>
>
> I have no problem with having strict validation available.  But doesn't
> validation take significantly longer than decoding?  So I think it should be
> logically decoupled... do validation when/where it is needed for security
> reasons, and allow internal [de]coding to be faster.

I'd like to see benchmarks of such a claim.


> I'm mostly indifferent about which should be the default... maybe there
> shouldn't be a default!  Use the "vUTF-8" decoder for strict validation, and
> the "fUTF-8" decoder for the faster, non-validating version.  Or something
> like that.  With appropriate documentation.  Of course, "UTF-8" already
> exists... as "fUTF-8", so for compatibility, I guess it shouldn't change...
> but it could be deprecated.
>
>
> You didn't address the issue that if the decoding to a canonical form is
> done first, many of the insecurities just go away, so why throw errors?

Unicode is intended to allow interaction between various bits of
software.  It may be that a library checked it in UTF-8, then passed
it to python.  It would be nice if the library validated too, but a
major advantage of UTF-8 is older libraries (or protocols!) intended
for ASCII need only be 8-bit clean to be repurposed for UTF-8.  Their
security checks continue to work, so long as nobody down stream
introduces problems with a non-validating decoder.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Glenn Linderman
On approximately 12/7/2008 9:11 PM, came the following characters from 
the keyboard of Adam Olsen:

On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman <[EMAIL PROTECTED]> wrote:

On approximately 12/7/2008 8:13 PM, came the following characters from the
keyboard of Stephen J. Turnbull:

Glenn Linderman writes:

 > But if you are interested in checking for security issues, shouldn't
you  >   _first_ decode into some canonical form,

Yes.  That's all that is being asked for: that Python do strict
decoding to a canonical form by default.  That's a lot to ask, as it
turns out, but that is what we (the minority of strict Unicode
adherents, that is) want.


I have no problem with having strict validation available.  But doesn't
validation take significantly longer than decoding?  So I think it should be
logically decoupled... do validation when/where it is needed for security
reasons, and allow internal [de]coding to be faster.


I'd like to see benchmarks of such a claim.



"significantly" seems to be the only word at question; it seems that 
there are a fair number of validation checks that could be performed; 
the numeric part of UTF-8 decoding is just a sequence of shifts, masks, 
and ORs, so can be coded pretty tightly in C or assembly language.


Anything extra would be slower; how much slower is hard to predict prior 
to the implementation.  My "significantly" was just the expectation that 
the larger code with more conditional branches that is required for 
validation is less likely to stay in cache, and take longer to load into 
cache, and take longer to execute.  This also seems to be supported by 
Stephen's comment "That's a lot to ask, as it turns out."


Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, 
I wonder if I could find that code?  Can you supply a validated decoder? 
 Then we could run some benchmarks, eh?




I'm mostly indifferent about which should be the default... maybe there
shouldn't be a default!  Use the "vUTF-8" decoder for strict validation, and
the "fUTF-8" decoder for the faster, non-validating version.  Or something
like that.  With appropriate documentation.  Of course, "UTF-8" already
exists... as "fUTF-8", so for compatibility, I guess it shouldn't change...
but it could be deprecated.


You didn't address the issue that if the decoding to a canonical form is
done first, many of the insecurities just go away, so why throw errors?


Unicode is intended to allow interaction between various bits of
software.  It may be that a library checked it in UTF-8, then passed
it to python.  It would be nice if the library validated too, but a
major advantage of UTF-8 is older libraries (or protocols!) intended
for ASCII need only be 8-bit clean to be repurposed for UTF-8.  Their
security checks continue to work, so long as nobody down stream
introduces problems with a non-validating decoder.



So I don't understand how this is responsive to the "decoding removes 
many insecurities" issue?


Yes, you might use libraries.  Either they have insecurities, or not. 
Either they validate, or not.  Either they decode, or not.  They may be 
immune to certain attacks, because of their structure and code, or not.


So when you examine a library for potential use, you have documentation 
or code to help you set your expectations about what it does, and 
whether or not it may have vulnerabilities, and whether or not those 
vulnerabilities are likely or unlikely, whether you can reduce the 
likelihood or prevent the vulnerabilities by wrapping the API, etc.  And 
so you choose to use the library, or not.


This whole discussion about libraries seems somewhat irrelevant to the 
question at hand, although it is certainly true that understanding how a 
library handles Unicode is an important issue for the potential user of 
a library.


So how does a non-validating decoder introduce problems?  I can see that 
it might not solve all problems, but how does it introduce problems? 
Wouldn't the problems be introduced by something else, and the use of a 
non-validating decoder may not catch the problem... but not be the cause 
of the problem?


And then, if you would like to address the original issue, that would be 
fine too.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen
On Sun, Dec 7, 2008 at 11:04 PM, Glenn Linderman <[EMAIL PROTECTED]> wrote:
> On approximately 12/7/2008 9:11 PM, came the following characters from the
> keyboard of Adam Olsen:
>> On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman <[EMAIL PROTECTED]>
>> wrote:
>
> Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I
> wonder if I could find that code?  Can you supply a validated decoder?  Then
> we could run some benchmarks, eh?

There is no point for me, as the behaviour of a real UTF-8 codec is
clear.  It is you who needs to justify a second non-standard UTF-8-ish
codec.  See below.


>>> You didn't address the issue that if the decoding to a canonical form is
>>> done first, many of the insecurities just go away, so why throw errors?
>>
>> Unicode is intended to allow interaction between various bits of
>> software.  It may be that a library checked it in UTF-8, then passed
>> it to python.  It would be nice if the library validated too, but a
>> major advantage of UTF-8 is older libraries (or protocols!) intended
>> for ASCII need only be 8-bit clean to be repurposed for UTF-8.  Their
>> security checks continue to work, so long as nobody down stream
>> introduces problems with a non-validating decoder.
>
>
> So I don't understand how this is responsive to the "decoding removes many
> insecurities" issue?
>
> Yes, you might use libraries.  Either they have insecurities, or not. Either
> they validate, or not.  Either they decode, or not.  They may be immune to
> certain attacks, because of their structure and code, or not.
>
> So when you examine a library for potential use, you have documentation or
> code to help you set your expectations about what it does, and whether or
> not it may have vulnerabilities, and whether or not those vulnerabilities
> are likely or unlikely, whether you can reduce the likelihood or prevent the
> vulnerabilities by wrapping the API, etc.  And so you choose to use the
> library, or not.
>
> This whole discussion about libraries seems somewhat irrelevant to the
> question at hand, although it is certainly true that understanding how a
> library handles Unicode is an important issue for the potential user of a
> library.
>
> So how does a non-validating decoder introduce problems?  I can see that it
> might not solve all problems, but how does it introduce problems? Wouldn't
> the problems be introduced by something else, and the use of a
> non-validating decoder may not catch the problem... but not be the cause of
> the problem?
>
> And then, if you would like to address the original issue, that would be
> fine too.

Your non-validating encoder is translating an invalid sequence into a
valid one, thus you are introducing the problem.  A completely naive
environment (8-bit clean ASCII) would leave it as an invalid sequence
throughout.

This is not a theoretical problem.  See
http://tools.ietf.org/html/rfc3629#section-10 .  We MUST reject
invalid sequences, or else we are not using UTF-8.  There is no wiggle
room, no debate.

(The absoluteness is why the standard behaviour doesn't need a
benchmark.  You are essentially arguing that, when logging in as root
over the internet, it's a lot faster if you use telnet rather than
ssh.  One is simply not an option.)


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com