Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Stephen J. Turnbull
Toshio Kuratomi writes:

 > Sure ... but with these systems, neither read-modules-as-locale or
 > read-modules-as-utf-8 are a good solution to work, correct?

Good solution, no, but I believe that read-modules-as-locale *should*
work to a great extent.  AFAIK Python 3 reads Python programs as str
(ie, converting to Unicode -- if it doesn't, it *should*).

 > Especially if the OS does get upgraded but the filesystems with
 > user data (and user created modules) are migrated as-is, you'll run
 > into situations where system installed modules are in utf-8 and
 > user created modules are shift-jis and so something will always be
 > broken.

I don't know what you mean by "system-installed modules".  If you're
talking about Python itself, it's not a problem.  Python doesn't have
any Japanese-named modules in any encoding.

On the other hand, *everything* that involves scripting (shell
scripts, make, etc) related to those filesystems will be broken
*unless* the system, after upgrade but before going live, is converted
to have an appropriate locale encoding.  So I don't really see a
problem here.

The problem is portability across systems, and that is a problem that
only the third-party transports can really deal with.  tar and unzip
need to be taught how to change file names to the locale, etc.

 > The only way to make sure that modules work is to restrict them to ASCII-only
 > on the filesystem.  But because unicode module names are seen as
 > a necessary feature, the question is which way forward is going to lead to
 > the least brokenness.  Which could be locale... but from the python2
 > locale-related bugs that I get to look at, I doubt.

AFAICS this is going to be site-specific.  End of story.  Or, if you
prefer, "maru-nage".

IMHO, Python 2 locale bugs are unlikely to be a good guide to Python 3
locale bugs because in Python 2 most people just ignore locale and use
"native" strings (~= bytes in Python 3), and that typically "just
works".  In Python 3 that just *doesn't* work any more because you get
a UnicodeError on import, etc, etc.

IMHO, YMMV, and all that.  I know *of* such systems (there remain
quite a few here used by student and research labs), but the ones I
maintain were easy to convert to UTF-8 because I don't export file
systems (except my private files for my own use); everything is
mediated by Apache and Zope, and browsers are happy to cope if I
change from EUC-JP to UTF-8 and then flip the Apache switch to change
default encodings.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Victor Stinner
Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
> Why not locale:
> * Relying on locale is simply not portable. (...)
> * Mixing of modules from different locales won't work. (...)

I don't understand what you are talking about.

When you import a module, the module name becomes a filename. On
Windows, you can reuse the Unicode name directly as a filename. On the
other OSes, you have to encode the name to filesystem encoding. During
Python 3.2 development, we tried to be able to use a filesystem encoding
different than the locale encoding (PYTHONFSENCODING environment
variable): but it doesn't work simply because Python is not alone in the
OS. Except Python, all programs speak the same "language": the locale
encoding. Let's try to give you an example: if create a module with a
name encoded to UTF-8, your file browser will display mojibake.

I don't understand the relation between the local filesystem encoding
and the portability. I suppose that you are talking about the
distribution of a module to other computers. Here the question is how
the filenames are stored during the transfer. The user is free to use
any tool, and try to find a tool handling Unicode correctly :-) But it's
no more the Python problem.

Each computer uses a different locale encoding. You have to use it to
cooperate with other programs and avoid mojibake. But I don't understand
why you write that "Mixing of modules from different locales won't
work". If you use a tool storing filenames in your locale encoding (eg.
TAR file format... and sometimes the ZIP format), the problem comes from
your tool and you should use another tool.

I created http://bugs.python.org/issue10972 to workaround ZIP tools
supposing that ZIP files use the locale encoding instead of cp497: this
issue adds an option to force the usage of the Unicode flag (and so
store filenames to UTF-8). Even if initially, I created the issue to
workaround a bootstrap issue (#10955).

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py

2011-01-26 Thread Victor Stinner
Hi,

Le mardi 25 janvier 2011 à 18:07 -0800, Brett Cannon a écrit :
> This broke the buildbots (R. David Murray thinks you may have
> forgotten to call super() in the 'payload is None' branch). Are you
> getting code reviews and fully running the test suite before
> committing? We are in RC.
> (...)
> > -if _has_surrogates(msg._payload):
> > -self.write(msg._payload)
> > +payload = msg.get_payload()
> > +if payload is None:
> > +return
> > +if _has_surrogates(payload):
> > +self.write(payload)

I didn't realize that such minor change can do anything harmful: the
parent method (Generator._handle_text) has exaclty the same test. If
msg._payload is None, call the parent method with None does nothing. But
_has_surrogates() doesn't support None.

The problem is not the test of None, but replacing msg._payload by
msg.get_payload(). I thought that get_payload() was a dummy getter
reading self._payload, but I was completly wrong :-)

I was stupid to not run at least test_email, sorry. And no, I didn't ask
for a review, because I thought that such minor change cannot be
harmful.

FYI the commit is related indirectly to #9124 (Mailbox module should use
binary I/O, not text I/O).

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Martin v. Löwis
Am 26.01.2011 10:40, schrieb Victor Stinner:
> Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
>> Why not locale:
>> * Relying on locale is simply not portable. (...)
>> * Mixing of modules from different locales won't work. (...)
> 
> I don't understand what you are talking about.

I think by "portability", he means "moving files from one computer to
another". He argues that if Python would mandate UTF-8 for all file
names on Unix, moving files in such a way would support portability,
whereas using the locale's filename might not (if the locale use a
different charset on the target system).

While this is technically true, I don't think it's a helpful way of
thinking: by mandating that file names are UTF-8 when accessed from
Python, we make the actual files inaccessible on both the source and
the target system.

> I don't understand the relation between the local filesystem encoding
> and the portability. I suppose that you are talking about the
> distribution of a module to other computers. Here the question is how
> the filenames are stored during the transfer. The user is free to use
> any tool, and try to find a tool handling Unicode correctly :-) But it's
> no more the Python problem.

There are cases where there is no real "transfer", in the sense in which
you are using the word. For example, with NFS, you can access the very
same file simultaneously on two systems, with no file name conversion
(unless you are using NFSv4, and unless your NFSv4 implementations
support the UTF-8 mandate in NFS well).

Also, if two users of the same machine have different locale settings,
the same file name might be interpreted differently.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Oleg Broytman
On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. L??wis" wrote:
> There are cases where there is no real "transfer", in the sense in which
> you are using the word. For example, with NFS, you can access the very
> same file simultaneously on two systems, with no file name conversion
> (unless you are using NFSv4, and unless your NFSv4 implementations
> support the UTF-8 mandate in NFS well).
> 
> Also, if two users of the same machine have different locale settings,
> the same file name might be interpreted differently.

   I have a solution for all these problems, with a price, of course.
Let's use utf8+base64. Base64 uses a very restricted subset of ASCII and
filenames will never be interpreted whatever filesystem encodings would
be. The price is users loose standard OS tools like ls and find.
   I am partially joking, of course, but only partially.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Victor Stinner
Le mercredi 26 janvier 2011 à 11:12 +0100, "Martin v. Löwis" a écrit :
> There are cases where there is no real "transfer", in the sense in which
> you are using the word. For example, with NFS, you can access the very
> same file simultaneously on two systems, with no file name conversion
> (unless you are using NFSv4, and unless your NFSv4 implementations
> support the UTF-8 mandate in NFS well).

Python encodes the module name to the locale encoding to create a
filename. If the locale encoding is not the encoding used on the NFS
server, it doesn't work, but I don't think that Python has to workaround
this issue. If an user plays with non-ASCII module names, (s)he has to
understand that (s)he will have to fight against badly configured
systems and tools unable to handle Unicode correctly. We might warn
him/her in the documentation.

If NFSv3 doesn't reencode filenames for each client and the clients
don't reencode filenames, all clients have to use the same locale
encoding than the server. Otherwise, I don't see how it can work.

> Also, if two users of the same machine have different locale settings,
> the same file name might be interpreted differently.

Except Mac OS X and Windows, no kernel supports Unicode and so all users
of the same computer have to use the same locale encoding, or they will
not be able to share non-ASCII filenames.

--

Again, I don't think that Python should do anything special to
workaround these issues.

(Hardcode the module filename encoding to UTF-8 doesn't work for all the
reasons explained in other emails.)

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Nick Coghlan
On Wed, Jan 26, 2011 at 11:50 AM, Dj Gilcrease  wrote:
> On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg  wrote:
>> I also don't see how this could save a lot of memory. As an example
>> take a French text with say 10mio code points. This would end up
>> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
>> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
>> on how many accents are used). That's a saving of -10MB compared to
>> today's implementation :-)
>
> If I am reading the pep right, which I may not be as I am no expert on
> unicode, the new implementation would actually give a 10MB saving
> since the wchar field is optional, so only the str (Latin-1) and utf8
> fields would need to be stored. How it decides not to store one field
> or another would need to be clarified in the pep is I am right.

The PEP actually does define that already:

PyUnicode_AsUTF8 populates the utf8 field of the existing string,
while PyUnicode_AsUTF8String creates a *new* string with that field
populated.

PyUnicode_AsUnicode will populate the wstr field (but doing so
generally shouldn't be necessary).

For a UCS4 build, my reading of the PEP puts the memory savings for a
100 code point string as follows:

Current size: 400 bytes (regardless of max code point)

New initial size (max code point < 256): 100 bytes (75% saving)
New initial size (max code point < 65536): 200 bytes (50% saving)
New initial size (max code point >= 65536): 400 bytes (no saving)

For each of the "new" strings, they may consume additional storage if
the utf8 or wstr fields get populated. The maximum possible size would
be a UCS4 string (max code point >= 65536) on a sizeof(wchar_t) == 2
system with the utf8 string populated. In such cases, you would
consume at least 700 bytes, plus whatever additional memory is needed
to encode the non-BMP characters into UTF-8 and UTF-16.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py

2011-01-26 Thread Nick Coghlan
On Wed, Jan 26, 2011 at 7:57 PM, Victor Stinner
 wrote:
> I was stupid to not run at least test_email, sorry. And no, I didn't ask
> for a review, because I thought that such minor change cannot be
> harmful.

During the RC period, *everything* that touches the code base should
be reviewed by a second committer before checkin, and sanctioned by
the RM as well. This applies even for apparently trivial changes.

Docs checkins are slightly less strict (especially Raymond finishing
off the What's New), but even there it's preferable to be cautious in
the run up to a final release.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Paul Moore
On 26 January 2011 12:30, Nick Coghlan  wrote:
> The PEP actually does define that already:
>
> PyUnicode_AsUTF8 populates the utf8 field of the existing string,
> while PyUnicode_AsUTF8String creates a *new* string with that field
> populated.
>
> PyUnicode_AsUnicode will populate the wstr field (but doing so
> generally shouldn't be necessary).

AIUI, another point is that the PEP deprecates the use of the calls
that populate the utf8 and wstr fields, in favour of the calls that
expect the caller to manage the extra memory (PyUnicode_AsUTF8String
rather than PyUnicode_AsUTF8, ??? rather than PyUnicode_AsUnicode). So
in the long term, the extra fields should never be populated -
although this could take some time as extensions have to be recoded.
Ultimately, the extra fields and older APIs could even be removed.

So any space cost (which I concede could be non-trivial in some cases)
is expected to be short-term.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread James Y Knight
On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote:
> During
> Python 3.2 development, we tried to be able to use a filesystem encoding
> different than the locale encoding (PYTHONFSENCODING environment
> variable): but it doesn't work simply because Python is not alone in the
> OS. Except Python, all programs speak the same "language": the locale
> encoding. Let's try to give you an example: if create a module with a
> name encoded to UTF-8, your file browser will display mojibake.

Is that really true? I'm pretty sure GTK+ treats all filenames as UTF-8 no 
matter what the locale says. (over-rideable by G_FILENAME_ENCODING or 
G_BROKEN_FILENAMES)

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Victor Stinner
Le mercredi 26 janvier 2011 à 08:24 -0500, James Y Knight a écrit :
> On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote:
> > During
> > Python 3.2 development, we tried to be able to use a filesystem encoding
> > different than the locale encoding (PYTHONFSENCODING environment
> > variable): but it doesn't work simply because Python is not alone in the
> > OS. Except Python, all programs speak the same "language": the locale
> > encoding. Let's try to give you an example: if create a module with a
> > name encoded to UTF-8, your file browser will display mojibake.
> 
> Is that really true? I'm pretty sure GTK+ treats all filenames as
> UTF-8 no matter what the locale says. (over-rideable by
> G_FILENAME_ENCODING or G_BROKEN_FILENAMES)

Not exactly. Gtk+ uses the glib library, and to encode/decode filenames,
the glib library uses:

 - UTF-8 on Windows
 - G_FILENAME_ENCODING environment variable if set (comma-separated list
of encodings)
 - UTF-8 if G_BROKEN_FILENAMES env var is set
 - or the locale encoding

glib has no type to store a filename, a filename is a raw byte string
(char*). It has a nice function to workaround mojibake issues:
g_filename_display_name(). This function tries to decode the filename
from each encoding of the filename encoding list, if all decodings
failed, use UTF-8 and escape undecodable bytes.

So yes, if you set G_FILENAME_ENCODING you can fix mojibake issues. But
you have to pass the raw bytes filenames to other libraries and
programs.

The problem with PYTHONFSENCODING is that sys.getfilesystemencoding() is
not only used for the filenames, but also for the command line arguments
and the environment variables.

For more information about glib, see g_filename_to_utf8(),
g_filename_display_name() and g_get_filename_charsets() documentation:

http://library.gnome.org/devel/glib/2.26/glib-Character-Set-Conversion.html

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py

2011-01-26 Thread Brett Cannon
On Wed, Jan 26, 2011 at 04:34, Nick Coghlan  wrote:
> On Wed, Jan 26, 2011 at 7:57 PM, Victor Stinner
>  wrote:
>> I was stupid to not run at least test_email, sorry. And no, I didn't ask
>> for a review, because I thought that such minor change cannot be
>> harmful.
>
> During the RC period, *everything* that touches the code base should
> be reviewed by a second committer before checkin, and sanctioned by
> the RM as well. This applies even for apparently trivial changes.

Especially as this is not the first slip-up; Raymond had a
copy-and-paste slip that broke the buildbots. Luckily he was in
#python-dev when it happened and it was noticed fast enough he fixed
in in under a minute.

So yes, even stuff we would all consider minor **must** have a review.
Time to update the devguide I think.

-Brett
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py

2011-01-26 Thread Georg Brandl
Am 26.01.2011 10:57, schrieb Victor Stinner:
> Hi,
> 
> Le mardi 25 janvier 2011 à 18:07 -0800, Brett Cannon a écrit :
>> This broke the buildbots (R. David Murray thinks you may have
>> forgotten to call super() in the 'payload is None' branch). Are you
>> getting code reviews and fully running the test suite before
>> committing? We are in RC.
>> (...)
>> > -if _has_surrogates(msg._payload):
>> > -self.write(msg._payload)
>> > +payload = msg.get_payload()
>> > +if payload is None:
>> > +return
>> > +if _has_surrogates(payload):
>> > +self.write(payload)
> 
> I didn't realize that such minor change can do anything harmful:

That's why the rule is that *every change needs to be reviewed*, not
*every change that doesn't look harmful needs to be reviewed*.

(This is true only for code changes, of course.  Doc changes rarely have
hidden bugs, nor are they embarrassing when a bug slips into the release.
And I get the "test suite" (building the docs) results twice a day and
can fix problems myself.)

> the
> parent method (Generator._handle_text) has exaclty the same test. If
> msg._payload is None, call the parent method with None does nothing. But
> _has_surrogates() doesn't support None.
> 
> The problem is not the test of None, but replacing msg._payload by
> msg.get_payload(). I thought that get_payload() was a dummy getter
> reading self._payload, but I was completly wrong :-)
>
> I was stupid to not run at least test_email, sorry. And no, I didn't ask
> for a review, because I thought that such minor change cannot be
> harmful.

I hope you know better now :)  *Always* run the test suite *before* even
asking for review.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread James Y Knight

On Jan 26, 2011, at 11:47 AM, Victor Stinner wrote:
> Not exactly. Gtk+ uses the glib library, and to encode/decode filenames,
> the glib library uses:
> 
> - UTF-8 on Windows
> - G_FILENAME_ENCODING environment variable if set (comma-separated list
> of encodings)
> - UTF-8 if G_BROKEN_FILENAMES env var is set
> - or the locale encoding


But the documentation says:

> On Unix, the character sets are determined by consulting the environment 
> variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES. On Windows, the 
> character set used in the GLib API is always UTF-8 and said environment 
> variables have no effect.
> 
> G_FILENAME_ENCODING may be set to a comma-separated list of character set 
> names. The special token "@locale" is taken to mean the character set for 
> thecurrent locale. If G_FILENAME_ENCODING is not set, but G_BROKEN_FILENAMES 
> is, the character set of the current locale is taken as the filename 
> encoding. If neither environment variable is set, UTF-8 is taken as the 
> filename encoding, but the character set of the current locale is also put in 
> the list of encodings.

Which indicates to me that (unless you override the behavior with env vars) it 
encodes filenames in UTF-8 regardless of the locale, and attempts decoding in 
UTF-8 primarily. And that only when the filename doesn't make sense in UTF-8, 
it will also try decoding it in the locale encoding.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py

2011-01-26 Thread Andreas Stührk
> I gets to a dict of class circumventing dictproxy. It's yet unclear
> why it segfaults.

The crash as well as the output "1" are both caused because updating
the class dictionary directly doesn't invalidate the method cache.
When the new value for "f" is assigned to the dict, the old "f" gets
garbage collected (because the method cache uses borrowed references),
but there is still an entry in the cache for the (now
garbage-collected) function. When "a.f" is executed next, the entry of
the cache is used and a new method is created. When that method gets
called, it returns "1" and when the interpreter tries to garbage
collect the new method on interpreter finalization, it segfaults
because the referenced "f" is already collected.

Regards,
Andreas
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Martin v. Löwis
> If NFSv3 doesn't reencode filenames for each client and the clients
> don't reencode filenames, all clients have to use the same locale
> encoding than the server. Otherwise, I don't see how it can work.

In practice, users accept that they get mojibake - their editors can
still open the files, and they can double-click them in a file browser
just fine. So it doesn't really need to work, and users can still use
it.

> Again, I don't think that Python should do anything special to
> workaround these issues.

I agree, and I'm certainly in favor of keeping the current code base.
Just make sure you understand the reasoning of those opposing.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Toshio Kuratomi
On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. Löwis" wrote:
> Am 26.01.2011 10:40, schrieb Victor Stinner:
> > Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
> >> Why not locale:
> >> * Relying on locale is simply not portable. (...)
> >> * Mixing of modules from different locales won't work. (...)
> > 
> > I don't understand what you are talking about.
> 
> I think by "portability", he means "moving files from one computer to
> another". He argues that if Python would mandate UTF-8 for all file
> names on Unix, moving files in such a way would support portability,
> whereas using the locale's filename might not (if the locale use a
> different charset on the target system).
> 
> While this is technically true, I don't think it's a helpful way of
> thinking: by mandating that file names are UTF-8 when accessed from
> Python, we make the actual files inaccessible on both the source and
> the target system.
> 
> > I don't understand the relation between the local filesystem encoding
> > and the portability. I suppose that you are talking about the
> > distribution of a module to other computers. Here the question is how
> > the filenames are stored during the transfer. The user is free to use
> > any tool, and try to find a tool handling Unicode correctly :-) But it's
> > no more the Python problem.
> 
> There are cases where there is no real "transfer", in the sense in which
> you are using the word. For example, with NFS, you can access the very
> same file simultaneously on two systems, with no file name conversion
> (unless you are using NFSv4, and unless your NFSv4 implementations
> support the UTF-8 mandate in NFS well).
> 
> Also, if two users of the same machine have different locale settings,
> the same file name might be interpreted differently.
> 
Thanks Martin, I think that you understand my view even if you don't share
it.

There's one further case that I am worried about that has no real
"transfer".  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.

-Toshio


pgpRiKtOLoK13.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Neil Hodgson
Toshio Kuratomi:

> When they update their OS to a version that has
> utf-8 python module names, they will find that they have to make a choice.
> They can either change their locale settings to a utf-8 encoding and have
> the system installed modules work or they can leave their encoding on their
> non-utf-8 encoding and have the modules that they've created on-site work.

   When switching to a UTF-8 locale, they can also change the file
names of their modules to be encoded in UTF-8. It would be fairly easy
to write a script that identifies non-ASCII file names in a directory
and offers to transcode their names from their current encoding to
UTF-8.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Glenn Linderman

On 1/26/2011 4:47 PM, Toshio Kuratomi wrote:

There's one further case that I am worried about that has no real
"transfer".  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.


The way this case should work, is that programs that install files 
(installation is a form of transfer) should transform their names from 
the encoding used in the transfer medium to the encoding of the 
filesystem on which they are installed.


Python3 should access the files, transforming the names from the 
encoding of the filesystem on which they are installed to Unicode for 
use by the program.


I think Python3 is trying to do its part, and Victor is trying to make 
that more robust on more platforms, specifically Windows.


The programs that install files, which may include programs that install 
Python files I don't know, may or may not be doing their part, but 
clearly there are cases where they do not.


Systems that have different encodings for names on the same or different 
file systems need to have a way to obtain the encoding for the file 
names, so they can be properly decoded.  If they don't have such a way, 
they are broken.


=
The rest of this is an attempt to describe the problem of Linux and 
other systems which use byte strings instead of character strings as 
file names.  No problem, as long as programs allow byte strings as file 
names.  Python3 does not, for the import statement, thus the problem is 
relevant for discussion here, as has been ongoing.

=

Since file names are defined to be byte strings, there is no way to 
obtain the encoding for file names, so they cannot always be decoded, 
and sometimes not properly decoded, because no one knows which encoding 
was used to create them, _if any_.


Hence, Linux programs that use character strings as file names 
internally and expect them to match the byte strings in the file system 
are promoting a fiction: that there is a transformation (encoding) from 
character strings to byte strings that will match.


When using ASCII character strings, they can be transformed to bytes 
using a simple transformation: identity... but that isn't necessarily 
correct, if the files were created using EBCDIC (unlikely on Linux 
systems, but not impossible, since Linux files are byte strings).


When using non-ASCII character strings, the fiction promoted is even 
bigger, and the transformation even harder.  Any 8-bit character 
encoding can pretend that identity is the correct transformation, but 
the result is mojibake if it isn't.  Unicode other multi-byte encodings 
have an even harder job, because there can be 8-bit sequences that are 
not legal for some transformations, but are legal for others.  This is 
when the fiction is exposed!


As the recent description of glib points out, when the file names are 
read as bytes, and shown to the user for selection, possibly using some 
mojibake-generating transformation to characters, the user has a 
fighting chance to pick the right file, less chance if the 
transformation is lossy ('?' substitutions, etc.) and/or the names are 
redundant in their lossless characters.


However, when the specification of the name is in characters (such as 
for Python import, or file names specified as character constants in any 
application system that provides/permits such), and there are large 
numbers of transformations that could be used to convert characters to 
bytes, the problem is harder, and error-prone... programs that want to 
promote the fiction of using characters for filenames must work harder.  
It seems that Python on Linux is such a program.


One technique is to have conventions agreed on by applications and users 
to limit the number of encodings used on a particular system to one 
(optimal) or a few, the latter requires understanding that files created 
in one encoding may not be accessible by systems that use a diffe

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Gregory P. Smith
On Mon, Jan 24, 2011 at 3:20 PM, Antoine Pitrou  wrote:
> Le mardi 25 janvier 2011 à 00:07 +0100, "Martin v. Löwis" a écrit :
>> >> I'd like to propose PEP 393, which takes a different approach,
>> >> addressing both problems simultaneously: by getting a flexible
>> >> representation (one that can be either 1, 2, or 4 bytes), we can
>> >> support the full range of Unicode on all systems, but still use
>> >> only one byte per character for strings that are pure ASCII (which
>> >> will be the majority of strings for the majority of users).
>> >
>> > For this kind of experiment, I think a concrete attempt at implementing
>> > (together with performance/memory savings numbers) would be much more
>> > useful than an abstract proposal.
>>
>> I partially agree. An implementation is certainly needed, but there is
>> nothing wrong (IMO) with designing the change before implementing it.
>> Also, several people have offered to help with the implementation, so
>> we need to agree on a specification first (which is actually cheaper
>> than starting with the implementation only to find out that people
>> misunderstood each other).
>
> I'm not sure it's really cheaper. When implementing you will probably
> find out that it makes more sense to change the meaning of some fields,
> add or remove some, etc. You will also want to try various tweaks since
> the whole point is to lighten the footprint of unicode strings in common
> workloads.

Yep.  This is only a proposal, an implementation will allow all of
that to be experimented with.

I have frequently see code today, even in python 2.x, that suffers
greatly from unicode vs string use (due to APIs in some code that were
returning unicode objects unnecessarily when the data was really all
ascii text).  python 3.x only increases this as the default for so
many things passes through unicode even for programs that may not need
it.

>
> So, the only criticism I have, intuitively, is that the unicode
> structure seems to become a bit too large. For example, I'm not sure you
> need a generic (pointer, size) pair in addition to the
> representation-specific ones.

I believe the intent this pep is aiming at is for the existing in
memory structure to be compatible with already compiled binary
extension modules without having to recompile them or change the APIs
they are using.

Personally I don't care at all about preserving that level of binary
compatibility, it has been convenient in the past but is rarely the
right thing to do.  Of course I'd personally like to see PyObject
nuked and revisited, it is too large and is probably not cache line
efficient.

>
> Incidentally, to slightly reduce the overhead the unicode objects,
> there's this proposal: http://bugs.python.org/issue1943

Interesting.  But that aims more at cpu performance than memory
overhead.  What I see is programs that predominantly process ascii
data yet waste memory on a 2-4x data explosion of the internal
representation.  This PEP aims to address that larger target.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com