> Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC
> to create a file, you have to reuse NFC to open your file (and the same for
> NFD).
That's not news to me. Of course it does: Unix is completely agnostic of
encodings in file APIs. On the implementation level, it's
> However, Martin, I can promise you that I will _never_ ask for any
> convenience functions related to bytes as a result of this decision.
:-)
Regards,
Martin
___
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/p
Martin v. Löwis wrote:
Guido van Rossum wrote:
However
the *proposed* behavior (returns bytes if the arg was bytes, and
returns str when the arg was str) is IMO sane, and no different than
the polymorphism found in len() or many builtin operations.
My concern still is that it brings the bytes
On Sep 30, 2008, at 10:06 PM, [EMAIL PROTECTED] wrote:
However, Martin, I can promise you that I will _never_ ask for any
convenience functions related to bytes as a result of this
decision. I want bytes to come back from filesystem APIs because I
intend to have a wrapper layer which knows
On Tue, Sep 30, 2008 at 8:06 PM, <[EMAIL PROTECTED]> wrote:
> The proposal of using U+ seems like it would have been almost the same
> from such a wrapper's perspective, except (A) people using the filesystem
> APIs without the benefit of such a wrapper would have been even more
> screwed, and
James Y Knight wrote:
Since from what I've tried, things seem to work, I'd really like to
know what precisely does fail from the opponents of utf-8b.
Seems like what will fail is taking one of these utf-8b
decoded names and passing it to some external library
that uses it as a filename withou
On Sep 30, 2008, at 5:51 PM, Martin v. Löwis wrote:
While I can sympathize with people having non-ASCII file names on
their
disks, I can't sympathize with this example. Normal users just don't
put \x90 into their command lines, and those who do deserve the error
message they get.
That's just
Le Wednesday 01 October 2008 00:28:22 Martin v. Löwis, vous avez écrit :
> I don't think we will manage to release Python 3.0 this year if that
> change is to be implemented. And then, I don't think the release manager
> will agree to such a delay.
The minimum change is to disallow bytes/str mix:
Martin answered a similar question from Jack Jansen in another thread.
OSX doesn't normalize either. It's unlikely to confuse users in
practice.
On Tue, Sep 30, 2008 at 4:11 PM, Victor Stinner
<[EMAIL PROTECTED]> wrote:
> Since it's hard to follow the filename thread on two mailing list, i'm
> sta
Since it's hard to follow the filename thread on two mailing list, i'm
starting a new thread only on python-3000 about unicode normalization of the
filenames.
Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC
to create a file, you have to reuse NFC to open your file
On Sep 30, 2008, at 6:21 PM, Martin v. Löwis wrote:
IOW, Java hasn't solved the problem in the last 10 years.
Java is already really bad at being a small little language to write
cooperating tools in. I'd never even attempt to write a little
pipeline filter in Java -- I've already pretty mu
On Tue, Sep 30, 2008 at 3:21 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>>> My concern still is that it brings the bytes type into the status of
>>> another character string type, which is really bad, and will require
>>> further modifications to Python for the lifetime of 3.x.
>>
>> I'd like
> How does windows (and Python on windows) handle NFC versus NFD issues?
That's left to the application.
> Can I have two files called "ümlaut.txt", one in NFD and one NFC form?
Yes, you can. It sounds confusing, but only in a theoretical way. You
never have combining characters on Windows (at
> Yes! If there is a byte-string access method for Windows, pretty please
> make it decode from UTF-8 internally and call the Unicode version of the
> Windows APIs. The non-unicode windows APIs are pretty much just broken
> -- Ideally, Python should never be calling those.
I don't think we will ma
Le mardi 30 septembre 2008 à 23:33 +0200, "Martin v. Löwis" a écrit :
> > By the way, doesn't all this controversy yearn for a PEP?
>
> There must be a solution for 3.0 (which *could* be "it's a bug,
> don't use Python 3.0 on such broken systems"); we can't wait for
> a PEP to resolve this issue f
>> My concern still is that it brings the bytes type into the status of
>> another character string type, which is really bad, and will require
>> further modifications to Python for the lifetime of 3.x.
>
> I'd like to understand why this is "really bad". I though it was by
> design that the str
On Tue, Sep 30, 2008 at 11:47 AM, <[EMAIL PROTECTED]> wrote:
>
> On 05:56 pm, [EMAIL PROTECTED] wrote:
>>
>> On Tue, Sep 30, 2008 at 10:59 AM, <[EMAIL PROTECTED]> wrote:
>>>
>>> On 02:32 pm, [EMAIL PROTECTED] wrote:
>
>>> In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the
>>>
Guido van Rossum wrote:
> On Tue, Sep 30, 2008 at 2:31 PM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
>> I'm also starting to wonder if allowing mixed types might be the way to
>> go for these interfaces - leaving the bytes objects in place if the
>> Unicode decode operation fails.
>
> No, no, no
On Sep 30, 2008, at 5:40 PM, Martin v. Löwis wrote:
On Windows, we might reject bytes filenames for all file
operations: open(),
unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
Since I've seen no objections to this yet: please no. If we offer a
"lower-level" bytes filename
> $ ./python -c "import sys; print(sys.argv)" "$(echo -e 'filename\x90\x90')"
> Could not convert argument 3 to str
> $ ./python -c "import os; print(os.environ['DUMMY'])"
> Traceback (most recent call last):
> File "", line 1, in
> File "/home/ncoghlan/devel/py3k/Lib/os.py", line 389, in __ge
Jan Althaus wrote:
> Please correct me if I'm wrong, but it doesn't seem like there is a full
> documentation of PyModuleDef's members available?
That's most likely the case, yes.
> While some of them are intuitive, others aren't. The usage of m_size in
> particular isn't clear to me.
See PEP 31
> Oh, ok. I had assumed Windows just uses a fixed encoding without the problem
> of misencoded filenames.
It's the other way 'round: On Windows, Unicode file names are the
natural choice, and byte strings have limitations. In a sense, Windows
got it right - but then, they started later. Unix misse
>> On Windows, we might reject bytes filenames for all file operations: open(),
>> unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
>
> Since I've seen no objections to this yet: please no. If we offer a
> "lower-level" bytes filename API, it should work for all platforms.
Unfo
On Tue, Sep 30, 2008 at 2:31 PM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> I'm also starting to wonder if allowing mixed types might be the way to
> go for these interfaces - leaving the bytes objects in place if the
> Unicode decode operation fails.
No, no, no!
--
--Guido van Rossum (home p
On Tue, Sep 30, 2008 at 1:29 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
>> However
>> the *proposed* behavior (returns bytes if the arg was bytes, and
>> returns str when the arg was str) is IMO sane, and no different than
>> the polymorphism found in len() or many b
2008/9/30 Glenn Linderman <[EMAIL PROTECTED]>:
> So the problem is that a Unicode file system interface can't deal with
> non-UTF-8 byte streams as file names.
>
> So it seems there are four suggested approaches, all of which have aspects
> that are inconvenient.
Let's not forget what happens whe
> By the way, doesn't all this controversy yearn for a PEP?
There must be a solution for 3.0 (which *could* be "it's a bug,
don't use Python 3.0 on such broken systems"); we can't wait for
a PEP to resolve this issue for 3.0.
Most likely, the solution for 3.0 arrives through BDFL pronouncement,
i
James Y Knight wrote:
> Those aren't good behaviors, and can't be solved simply by pretending
> certain files don't exist.
A couple of output comparisons for two of James's examples (system
Python is 2.5.3, the Python :
$ python -V
Python 2.5.2
$ python -c "import sys; print sys.argv" "$(echo -e
On Tue, Sep 30, 2008 at 1:12 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
> Terry Reedy wrote:
>>
>> Guido van Rossum wrote:
>
>>> I'm not sure either way. I've heard it claim that Windows filesystem
>>> APIs use Unicode natively. Does Python 3.0 on Windows currently
>>> support filenames expressed a
On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
>> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]>
>> wrote:
Change the default file system encoding to store bytes in Unicode is like
introducing a new Python
On Tue, Sep 30, 2008 at 12:42 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
>>
>> On Tue, Sep 30, 2008 at 11:13 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
>>>
>>> Victor Stinner schrieb:
On Windows, we might reject bytes filenames for all file operations:
open
Please correct me if I'm wrong, but it doesn't seem like there is a
full documentation of PyModuleDef's members available?
While some of them are intuitive, others aren't. The usage of m_size
in particular isn't clear to me. I understand this is the size of
additional per-interpreter storage,
> I'm not sure either way. I've heard it claim that Windows filesystem
> APIs use Unicode natively. Does Python 3.0 on Windows currently
> support filenames expressed as bytes?
Yes, it does (at least, os.open, os.stat support them, builtin open
doesn't).
> Are they encoded first before
> passing
Guido van Rossum wrote:
> However
> the *proposed* behavior (returns bytes if the arg was bytes, and
> returns str when the arg was str) is IMO sane, and no different than
> the polymorphism found in len() or many builtin operations.
My concern still is that it brings the bytes type into the statu
> I didn't get an answer to my question: what is the result characters) stored in unicode> + ? I guess that the result is
> instead of raising an error
> (invalid types). So again: why introducing a new type instead of reusing
> existing Python types?
I didn't mean to introduce a new data typ
Terry Reedy wrote:
Guido van Rossum wrote:
I'm not sure either way. I've heard it claim that Windows filesystem
APIs use Unicode natively. Does Python 3.0 on Windows currently
support filenames expressed as bytes? Are they encoded first before
passing to the Unicode APIs? Using what encoding?
Martin v. Löwis v.loewis.de> writes:
>
> True. I try to outweigh the need for simplicity in the API against the
> need to support all cases. So I see two solutions:
>
> a) (...)
>
> b) (...)
By the way, doesn't all this controversy yearn for a PEP?
__
Guido van Rossum wrote:
> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>>> Change the default file system encoding to store bytes in Unicode is like
>>> introducing a new Python type: .
>> Exactly. Seems like the best solution to me, despite your polemics.
>
> Mar
2008/9/30 Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]>:
> I've experimentally implemented (not for Python) a different escaping
> scheme with a similar goal as UTF-8b: undecodable bytes are prefixed
> with U+ instead of being converted to unpaired surrogates, and
> '\x00' decodes as U+ U+
Guido van Rossum wrote:
On Tue, Sep 30, 2008 at 11:13 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
Victor Stinner schrieb:
On Windows, we might reject bytes filenames for all file operations: open(),
unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
Since I've seen no objection
Guido van Rossum schrieb:
> On Tue, Sep 30, 2008 at 11:13 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
>> Victor Stinner schrieb:
>>> On Windows, we might reject bytes filenames for all file operations: open(),
>>> unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
>>
>> Since I've s
On Sep 30, 2008, at 12:57 PM, Guido van Rossum wrote:
And again: if utf-8b isn't acceptable, because it does break things
in some
unknown-to-me way, I really can't imagine anything working but just
going
back to byte-string access as the only API. It's really not okay
for the
"obvious" API
On Tue, Sep 30, 2008 at 11:13 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
> Victor Stinner schrieb:
>> On Windows, we might reject bytes filenames for all file operations: open(),
>> unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
>
> Since I've seen no objections to this yet: pl
Victor Stinner schrieb:
> Hi,
>
> After reading the previous discussion, here is new proposition.
>
> Python 2.x and Windows are not affected by this issue. Only Python3 on POSIX
> (eg. Linux or *BSD) is affected.
>
> Some system are broken, but Python have to be able to open/copy/move/remove
On Tue, Sep 30, 2008 at 10:59 AM, <[EMAIL PROTECTED]> wrote:
> On 02:32 pm, [EMAIL PROTECTED] wrote:
>> If 2.6 weren't pretty much released already I'd ask to add
>> os.getcwdb() there, as an alias for os.getcwd(), and add a 2to3 fixer
>> that converts os.getcwdu() to os.getcwd(), leaves os.getcwd
On Tue, Sep 30, 2008 at 10:41 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
> Guido van Rossum <[EMAIL PROTECTED]> wrote:
>> On Tue, Sep 30, 2008 at 8:47 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
>> > Victor Stinner <[EMAIL PROTECTED]> wrote:
>> >
>> >> - listdir(unicode) -> only unicode, *skip* i
On 2008-09-30 18:46, Guido van Rossum wrote:
> On Tue, Sep 30, 2008 at 8:20 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> In the end, I think it's better not to be clever and just return
>> the filenames that cannot be decoded as bytes objects in os.listdir().
>
> Unfortunately that's going to b
On Sep 30, 2008, at 1:37 PM, Marcin 'Qrczak' Kowalczyk wrote:
I've experimentally implemented (not for Python) a different escaping
scheme with a similar goal as UTF-8b: undecodable bytes are prefixed
with U+ instead of being converted to unpaired surrogates, and
'\x00' decodes as U+ U+00
On Tue, Sep 30, 2008 at 10:28 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
>> How can it *regularly* drive you crazy when "the majority of fie names
>> [...] encoded correctly" (as you assert above)?
>
> Because Office files are a) often named with long, seemingly descriptive
> filenames, which inva
Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On Tue, Sep 30, 2008 at 8:47 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
> > Victor Stinner <[EMAIL PROTECTED]> wrote:
> >
> >> - listdir(unicode) -> only unicode, *skip* invalid filenames
> >>(as asked by Guido)
> >
> > Is there an option listdir(
2008/9/30 James Y Knight <[EMAIL PROTECTED]>:
u'\udc90\udc90'.encode('utf-8')
> '\xed\xb2\x90\xed\xb2\x90'
This is wrong: UTF-8 (like other UTF-x) encodes Unicode scalar values,
not Unicode code points, i.e. surrogates as such are unencodable.
'\xed\xb2\x90' is invalid UTF-8.
I've experimen
Guido van Rossum wrote:
On Mon, Sep 29, 2008 at 8:55 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
I know I keep flipflopping on this one, but the more I think about it
the more I believe it is better to drop those names than
On Tue, Sep 30, 2008 at 9:20 AM, James Y Knight <[EMAIL PROTECTED]> wrote:
>
> On Sep 29, 2008, at 11:11 PM, Stephen J. Turnbull wrote:
>
>>> Except...that one over there. That's the whole point of UTF-8b:
>>> correctly encoded names get decoded correctly and readably, and the
>>> other cases get d
On Tue, Sep 30, 2008 at 8:47 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
> Victor Stinner <[EMAIL PROTECTED]> wrote:
>
>> - listdir(unicode) -> only unicode, *skip* invalid filenames
>>(as asked by Guido)
>
> Is there an option listdir(bytes) which will return *all* filenames (as
> byte sequen
On Tue, Sep 30, 2008 at 8:20 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> In the end, I think it's better not to be clever and just return
> the filenames that cannot be decoded as bytes objects in os.listdir().
Unfortunately that's going to break most code that is using
os.listdir(), so it's ha
On Sep 29, 2008, at 11:11 PM, Stephen J. Turnbull wrote:
Except...that one over there. That's the whole point of UTF-8b:
correctly encoded names get decoded correctly and readably, and the
other cases get decoded into something unique that cannot possibly
conflict.
Sure. But there are lots o
On 12:47 am, [EMAIL PROTECTED] wrote:
This is the most sane contribution I've seen so far :).
See attached patch: python3_bytes_filename.patch
Using the patch, you will get:
- open() support bytes
- listdir(unicode) -> only unicode, *skip* invalid filenames
(as asked by Guido)
Forgive me fo
Victor Stinner <[EMAIL PROTECTED]> wrote:
> - listdir(unicode) -> only unicode, *skip* invalid filenames
>(as asked by Guido)
Is there an option listdir(bytes) which will return *all* filenames (as
byte sequences)? Otherwise, this seems troubling to me; *something*
should be returned for f
On 2008-09-30 16:05, Guido van Rossum wrote:
> On Tue, Sep 30, 2008 at 3:31 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> On 2008-09-30 08:00, Martin v. Löwis wrote:
Change the default file system encoding to store bytes in Unicode is like
introducing a new Python type: .
>>> Exactly. S
On Sep 29, 2008, at 7:50 PM, Adam Olsen wrote:
I'd rather the 1% of cases that need to handle bad file names make an
explicit effort to do so, via alternate byte APIs or (if necessary)
the 8859-1 hack.
So are you okay with python failing to run properly if the current
directory has strange by
On Tue, Sep 30, 2008 at 6:21 AM, <[EMAIL PROTECTED]> wrote:
> On 12:47 am, [EMAIL PROTECTED] wrote:
>
> This is the most sane contribution I've seen so far :).
Thanks. I'll review it later today (after coffee+breakfast :) and will
apply it assuming the code is reasonably sane, otherwise I'll go
a
On Tue, Sep 30, 2008 at 3:31 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> On 2008-09-30 08:00, Martin v. Löwis wrote:
>>> Change the default file system encoding to store bytes in Unicode is like
>>> introducing a new Python type: .
>>
>> Exactly. Seems like the best solution to me, despite your
Le Tuesday 30 September 2008 15:53:09 Guido van Rossum, vous avez écrit :
> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]>
wrote:
> >> Change the default file system encoding to store bytes in Unicode is
> >> like introducing a new Python type: .
> >
> > Exactly. Seems lik
On Tue, Sep 30, 2008 at 2:28 AM, Antoine Pitrou <[EMAIL PROTECTED]> wrote:
> Adam Olsen gmail.com> writes:
>>
>> The only way to display that file would be to transform it into some
>> other valid unicode string. However, as that string is already valid,
>> you've just made any files named after
On Mon, Sep 29, 2008 at 11:22 PM, Georg Brandl <[EMAIL PROTECTED]> wrote:
> No, that was not what I meant (although it is another possibility). As I
> wrote,
> Martin's proposal that I support here is using the modified UTF-8 codec that
> successfully roundtrips otherwise invalid UTF-8 data.
I th
Hi,
> This is the most sane contribution I've seen so far :).
Oh thanks.
> Do I understand properly that (listdir(bytes) -> bytes)?
Yes, os.listdir(bytes)->bytes. It's already the current behaviour.
But with Python3 trunk, os.listdir(str) -> str ... or bytes (if unicode
conversion fails).
>
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>> Change the default file system encoding to store bytes in Unicode is like
>> introducing a new Python type: .
>
> Exactly. Seems like the best solution to me, despite your polemics.
Martin, I don't understand why you
On Mon, Sep 29, 2008 at 8:55 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
>
>> Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
>
>>> I know I keep flipflopping on this one, but the more I think about it
>>> the more I believe it is better to drop those names than to raise an
Le lundi 29 septembre 2008 à 17:50 -0600, Adam Olsen a écrit :
> It's correct in the sense that it can roundtrip all filenames. UTF-8b
> is lossy, so certain filenames are not roundtripped properly.
Why do you say UTF-8b is lossy? From what I've read it claims to be
lossless (i.e. the range of ch
On Tue, Sep 30, 2008 at 5:24 AM, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Adam Olsen writes:
>
> > [1] You could argue that Unicode should add new scalars to handle all
> > currently invalid UTF-8 sequences.
>
> AFAIK there are about 2^31 of these, though!
They've promised to never alloc
On Tue, Sep 30, 2008 at 3:28 AM, Antoine Pitrou <[EMAIL PROTECTED]> wrote:
> Adam Olsen gmail.com> writes:
>>
>> The only way to display that file would be to transform it into some
>> other valid unicode string. However, as that string is already valid,
>> you've just made any files named after
Adam Olsen writes:
> [1] You could argue that Unicode should add new scalars to handle all
> currently invalid UTF-8 sequences.
AFAIK there are about 2^31 of these, though!
___
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/ma
On 2008-09-30 08:00, Martin v. Löwis wrote:
>> Change the default file system encoding to store bytes in Unicode is like
>> introducing a new Python type: .
>
> Exactly. Seems like the best solution to me, despite your polemics.
Not a bad idea... have os.listdir() return Unicode subclasses that
Adam Olsen gmail.com> writes:
>
> The only way to display that file would be to transform it into some
> other valid unicode string. However, as that string is already valid,
> you've just made any files named after it impossible to open.
Not if those valid sequences are also properly escaped t
74 matches
Mail list logo