Re: [Python-Dev] Import and unicode: part two
Toshio Kuratomi writes: > Sure ... but with these systems, neither read-modules-as-locale or > read-modules-as-utf-8 are a good solution to work, correct? Good solution, no, but I believe that read-modules-as-locale *should* work to a great extent. AFAIK Python 3 reads Python programs as str (ie, converting to Unicode -- if it doesn't, it *should*). > Especially if the OS does get upgraded but the filesystems with > user data (and user created modules) are migrated as-is, you'll run > into situations where system installed modules are in utf-8 and > user created modules are shift-jis and so something will always be > broken. I don't know what you mean by "system-installed modules". If you're talking about Python itself, it's not a problem. Python doesn't have any Japanese-named modules in any encoding. On the other hand, *everything* that involves scripting (shell scripts, make, etc) related to those filesystems will be broken *unless* the system, after upgrade but before going live, is converted to have an appropriate locale encoding. So I don't really see a problem here. The problem is portability across systems, and that is a problem that only the third-party transports can really deal with. tar and unzip need to be taught how to change file names to the locale, etc. > The only way to make sure that modules work is to restrict them to ASCII-only > on the filesystem. But because unicode module names are seen as > a necessary feature, the question is which way forward is going to lead to > the least brokenness. Which could be locale... but from the python2 > locale-related bugs that I get to look at, I doubt. AFAICS this is going to be site-specific. End of story. Or, if you prefer, "maru-nage". IMHO, Python 2 locale bugs are unlikely to be a good guide to Python 3 locale bugs because in Python 2 most people just ignore locale and use "native" strings (~= bytes in Python 3), and that typically "just works". In Python 3 that just *doesn't* work any more because you get a UnicodeError on import, etc, etc. IMHO, YMMV, and all that. I know *of* such systems (there remain quite a few here used by student and research labs), but the ones I maintain were easy to convert to UTF-8 because I don't export file systems (except my private files for my own use); everything is mediated by Apache and Zope, and browsers are happy to cope if I change from EUC-JP to UTF-8 and then flip the Apache switch to change default encodings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit : > Why not locale: > * Relying on locale is simply not portable. (...) > * Mixing of modules from different locales won't work. (...) I don't understand what you are talking about. When you import a module, the module name becomes a filename. On Windows, you can reuse the Unicode name directly as a filename. On the other OSes, you have to encode the name to filesystem encoding. During Python 3.2 development, we tried to be able to use a filesystem encoding different than the locale encoding (PYTHONFSENCODING environment variable): but it doesn't work simply because Python is not alone in the OS. Except Python, all programs speak the same "language": the locale encoding. Let's try to give you an example: if create a module with a name encoded to UTF-8, your file browser will display mojibake. I don't understand the relation between the local filesystem encoding and the portability. I suppose that you are talking about the distribution of a module to other computers. Here the question is how the filenames are stored during the transfer. The user is free to use any tool, and try to find a tool handling Unicode correctly :-) But it's no more the Python problem. Each computer uses a different locale encoding. You have to use it to cooperate with other programs and avoid mojibake. But I don't understand why you write that "Mixing of modules from different locales won't work". If you use a tool storing filenames in your locale encoding (eg. TAR file format... and sometimes the ZIP format), the problem comes from your tool and you should use another tool. I created http://bugs.python.org/issue10972 to workaround ZIP tools supposing that ZIP files use the locale encoding instead of cp497: this issue adds an option to force the usage of the Unicode flag (and so store filenames to UTF-8). Even if initially, I created the issue to workaround a bootstrap issue (#10955). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py
Hi, Le mardi 25 janvier 2011 à 18:07 -0800, Brett Cannon a écrit : > This broke the buildbots (R. David Murray thinks you may have > forgotten to call super() in the 'payload is None' branch). Are you > getting code reviews and fully running the test suite before > committing? We are in RC. > (...) > > -if _has_surrogates(msg._payload): > > -self.write(msg._payload) > > +payload = msg.get_payload() > > +if payload is None: > > +return > > +if _has_surrogates(payload): > > +self.write(payload) I didn't realize that such minor change can do anything harmful: the parent method (Generator._handle_text) has exaclty the same test. If msg._payload is None, call the parent method with None does nothing. But _has_surrogates() doesn't support None. The problem is not the test of None, but replacing msg._payload by msg.get_payload(). I thought that get_payload() was a dummy getter reading self._payload, but I was completly wrong :-) I was stupid to not run at least test_email, sorry. And no, I didn't ask for a review, because I thought that such minor change cannot be harmful. FYI the commit is related indirectly to #9124 (Mailbox module should use binary I/O, not text I/O). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Am 26.01.2011 10:40, schrieb Victor Stinner: > Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit : >> Why not locale: >> * Relying on locale is simply not portable. (...) >> * Mixing of modules from different locales won't work. (...) > > I don't understand what you are talking about. I think by "portability", he means "moving files from one computer to another". He argues that if Python would mandate UTF-8 for all file names on Unix, moving files in such a way would support portability, whereas using the locale's filename might not (if the locale use a different charset on the target system). While this is technically true, I don't think it's a helpful way of thinking: by mandating that file names are UTF-8 when accessed from Python, we make the actual files inaccessible on both the source and the target system. > I don't understand the relation between the local filesystem encoding > and the portability. I suppose that you are talking about the > distribution of a module to other computers. Here the question is how > the filenames are stored during the transfer. The user is free to use > any tool, and try to find a tool handling Unicode correctly :-) But it's > no more the Python problem. There are cases where there is no real "transfer", in the sense in which you are using the word. For example, with NFS, you can access the very same file simultaneously on two systems, with no file name conversion (unless you are using NFSv4, and unless your NFSv4 implementations support the UTF-8 mandate in NFS well). Also, if two users of the same machine have different locale settings, the same file name might be interpreted differently. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. L??wis" wrote: > There are cases where there is no real "transfer", in the sense in which > you are using the word. For example, with NFS, you can access the very > same file simultaneously on two systems, with no file name conversion > (unless you are using NFSv4, and unless your NFSv4 implementations > support the UTF-8 mandate in NFS well). > > Also, if two users of the same machine have different locale settings, > the same file name might be interpreted differently. I have a solution for all these problems, with a price, of course. Let's use utf8+base64. Base64 uses a very restricted subset of ASCII and filenames will never be interpreted whatever filesystem encodings would be. The price is users loose standard OS tools like ls and find. I am partially joking, of course, but only partially. Oleg. -- Oleg Broytmanhttp://phdru.name/p...@phdru.name Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 26 janvier 2011 à 11:12 +0100, "Martin v. Löwis" a écrit : > There are cases where there is no real "transfer", in the sense in which > you are using the word. For example, with NFS, you can access the very > same file simultaneously on two systems, with no file name conversion > (unless you are using NFSv4, and unless your NFSv4 implementations > support the UTF-8 mandate in NFS well). Python encodes the module name to the locale encoding to create a filename. If the locale encoding is not the encoding used on the NFS server, it doesn't work, but I don't think that Python has to workaround this issue. If an user plays with non-ASCII module names, (s)he has to understand that (s)he will have to fight against badly configured systems and tools unable to handle Unicode correctly. We might warn him/her in the documentation. If NFSv3 doesn't reencode filenames for each client and the clients don't reencode filenames, all clients have to use the same locale encoding than the server. Otherwise, I don't see how it can work. > Also, if two users of the same machine have different locale settings, > the same file name might be interpreted differently. Except Mac OS X and Windows, no kernel supports Unicode and so all users of the same computer have to use the same locale encoding, or they will not be able to share non-ASCII filenames. -- Again, I don't think that Python should do anything special to workaround these issues. (Hardcode the module filename encoding to UTF-8 doesn't work for all the reasons explained in other emails.) Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
On Wed, Jan 26, 2011 at 11:50 AM, Dj Gilcrease wrote: > On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg wrote: >> I also don't see how this could save a lot of memory. As an example >> take a French text with say 10mio code points. This would end up >> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), >> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending >> on how many accents are used). That's a saving of -10MB compared to >> today's implementation :-) > > If I am reading the pep right, which I may not be as I am no expert on > unicode, the new implementation would actually give a 10MB saving > since the wchar field is optional, so only the str (Latin-1) and utf8 > fields would need to be stored. How it decides not to store one field > or another would need to be clarified in the pep is I am right. The PEP actually does define that already: PyUnicode_AsUTF8 populates the utf8 field of the existing string, while PyUnicode_AsUTF8String creates a *new* string with that field populated. PyUnicode_AsUnicode will populate the wstr field (but doing so generally shouldn't be necessary). For a UCS4 build, my reading of the PEP puts the memory savings for a 100 code point string as follows: Current size: 400 bytes (regardless of max code point) New initial size (max code point < 256): 100 bytes (75% saving) New initial size (max code point < 65536): 200 bytes (50% saving) New initial size (max code point >= 65536): 400 bytes (no saving) For each of the "new" strings, they may consume additional storage if the utf8 or wstr fields get populated. The maximum possible size would be a UCS4 string (max code point >= 65536) on a sizeof(wchar_t) == 2 system with the utf8 string populated. In such cases, you would consume at least 700 bytes, plus whatever additional memory is needed to encode the non-BMP characters into UTF-8 and UTF-16. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py
On Wed, Jan 26, 2011 at 7:57 PM, Victor Stinner wrote: > I was stupid to not run at least test_email, sorry. And no, I didn't ask > for a review, because I thought that such minor change cannot be > harmful. During the RC period, *everything* that touches the code base should be reviewed by a second committer before checkin, and sanctioned by the RM as well. This applies even for apparently trivial changes. Docs checkins are slightly less strict (especially Raymond finishing off the What's New), but even there it's preferable to be cautious in the run up to a final release. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
On 26 January 2011 12:30, Nick Coghlan wrote: > The PEP actually does define that already: > > PyUnicode_AsUTF8 populates the utf8 field of the existing string, > while PyUnicode_AsUTF8String creates a *new* string with that field > populated. > > PyUnicode_AsUnicode will populate the wstr field (but doing so > generally shouldn't be necessary). AIUI, another point is that the PEP deprecates the use of the calls that populate the utf8 and wstr fields, in favour of the calls that expect the caller to manage the extra memory (PyUnicode_AsUTF8String rather than PyUnicode_AsUTF8, ??? rather than PyUnicode_AsUnicode). So in the long term, the extra fields should never be populated - although this could take some time as extensions have to be recoded. Ultimately, the extra fields and older APIs could even be removed. So any space cost (which I concede could be non-trivial in some cases) is expected to be short-term. Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote: > During > Python 3.2 development, we tried to be able to use a filesystem encoding > different than the locale encoding (PYTHONFSENCODING environment > variable): but it doesn't work simply because Python is not alone in the > OS. Except Python, all programs speak the same "language": the locale > encoding. Let's try to give you an example: if create a module with a > name encoded to UTF-8, your file browser will display mojibake. Is that really true? I'm pretty sure GTK+ treats all filenames as UTF-8 no matter what the locale says. (over-rideable by G_FILENAME_ENCODING or G_BROKEN_FILENAMES) James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 26 janvier 2011 à 08:24 -0500, James Y Knight a écrit : > On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote: > > During > > Python 3.2 development, we tried to be able to use a filesystem encoding > > different than the locale encoding (PYTHONFSENCODING environment > > variable): but it doesn't work simply because Python is not alone in the > > OS. Except Python, all programs speak the same "language": the locale > > encoding. Let's try to give you an example: if create a module with a > > name encoded to UTF-8, your file browser will display mojibake. > > Is that really true? I'm pretty sure GTK+ treats all filenames as > UTF-8 no matter what the locale says. (over-rideable by > G_FILENAME_ENCODING or G_BROKEN_FILENAMES) Not exactly. Gtk+ uses the glib library, and to encode/decode filenames, the glib library uses: - UTF-8 on Windows - G_FILENAME_ENCODING environment variable if set (comma-separated list of encodings) - UTF-8 if G_BROKEN_FILENAMES env var is set - or the locale encoding glib has no type to store a filename, a filename is a raw byte string (char*). It has a nice function to workaround mojibake issues: g_filename_display_name(). This function tries to decode the filename from each encoding of the filename encoding list, if all decodings failed, use UTF-8 and escape undecodable bytes. So yes, if you set G_FILENAME_ENCODING you can fix mojibake issues. But you have to pass the raw bytes filenames to other libraries and programs. The problem with PYTHONFSENCODING is that sys.getfilesystemencoding() is not only used for the filenames, but also for the command line arguments and the environment variables. For more information about glib, see g_filename_to_utf8(), g_filename_display_name() and g_get_filename_charsets() documentation: http://library.gnome.org/devel/glib/2.26/glib-Character-Set-Conversion.html Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py
On Wed, Jan 26, 2011 at 04:34, Nick Coghlan wrote: > On Wed, Jan 26, 2011 at 7:57 PM, Victor Stinner > wrote: >> I was stupid to not run at least test_email, sorry. And no, I didn't ask >> for a review, because I thought that such minor change cannot be >> harmful. > > During the RC period, *everything* that touches the code base should > be reviewed by a second committer before checkin, and sanctioned by > the RM as well. This applies even for apparently trivial changes. Especially as this is not the first slip-up; Raymond had a copy-and-paste slip that broke the buildbots. Luckily he was in #python-dev when it happened and it was noticed fast enough he fixed in in under a minute. So yes, even stuff we would all consider minor **must** have a review. Time to update the devguide I think. -Brett ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py
Am 26.01.2011 10:57, schrieb Victor Stinner: > Hi, > > Le mardi 25 janvier 2011 à 18:07 -0800, Brett Cannon a écrit : >> This broke the buildbots (R. David Murray thinks you may have >> forgotten to call super() in the 'payload is None' branch). Are you >> getting code reviews and fully running the test suite before >> committing? We are in RC. >> (...) >> > -if _has_surrogates(msg._payload): >> > -self.write(msg._payload) >> > +payload = msg.get_payload() >> > +if payload is None: >> > +return >> > +if _has_surrogates(payload): >> > +self.write(payload) > > I didn't realize that such minor change can do anything harmful: That's why the rule is that *every change needs to be reviewed*, not *every change that doesn't look harmful needs to be reviewed*. (This is true only for code changes, of course. Doc changes rarely have hidden bugs, nor are they embarrassing when a bug slips into the release. And I get the "test suite" (building the docs) results twice a day and can fix problems myself.) > the > parent method (Generator._handle_text) has exaclty the same test. If > msg._payload is None, call the parent method with None does nothing. But > _has_surrogates() doesn't support None. > > The problem is not the test of None, but replacing msg._payload by > msg.get_payload(). I thought that get_payload() was a dummy getter > reading self._payload, but I was completly wrong :-) > > I was stupid to not run at least test_email, sorry. And no, I didn't ask > for a review, because I thought that such minor change cannot be > harmful. I hope you know better now :) *Always* run the test suite *before* even asking for review. Georg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 26, 2011, at 11:47 AM, Victor Stinner wrote: > Not exactly. Gtk+ uses the glib library, and to encode/decode filenames, > the glib library uses: > > - UTF-8 on Windows > - G_FILENAME_ENCODING environment variable if set (comma-separated list > of encodings) > - UTF-8 if G_BROKEN_FILENAMES env var is set > - or the locale encoding But the documentation says: > On Unix, the character sets are determined by consulting the environment > variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES. On Windows, the > character set used in the GLib API is always UTF-8 and said environment > variables have no effect. > > G_FILENAME_ENCODING may be set to a comma-separated list of character set > names. The special token "@locale" is taken to mean the character set for > thecurrent locale. If G_FILENAME_ENCODING is not set, but G_BROKEN_FILENAMES > is, the character set of the current locale is taken as the filename > encoding. If neither environment variable is set, UTF-8 is taken as the > filename encoding, but the character set of the current locale is also put in > the list of encodings. Which indicates to me that (unless you override the behavior with env vars) it encodes filenames in UTF-8 regardless of the locale, and attempts decoding in UTF-8 primarily. And that only when the filename doesn't make sense in UTF-8, it will also try decoding it in the locale encoding. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py
> I gets to a dict of class circumventing dictproxy. It's yet unclear > why it segfaults. The crash as well as the output "1" are both caused because updating the class dictionary directly doesn't invalidate the method cache. When the new value for "f" is assigned to the dict, the old "f" gets garbage collected (because the method cache uses borrowed references), but there is still an entry in the cache for the (now garbage-collected) function. When "a.f" is executed next, the entry of the cache is used and a new method is created. When that method gets called, it returns "1" and when the interpreter tries to garbage collect the new method on interpreter finalization, it segfaults because the referenced "f" is already collected. Regards, Andreas ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
> If NFSv3 doesn't reencode filenames for each client and the clients > don't reencode filenames, all clients have to use the same locale > encoding than the server. Otherwise, I don't see how it can work. In practice, users accept that they get mojibake - their editors can still open the files, and they can double-click them in a file browser just fine. So it doesn't really need to work, and users can still use it. > Again, I don't think that Python should do anything special to > workaround these issues. I agree, and I'm certainly in favor of keeping the current code base. Just make sure you understand the reasoning of those opposing. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. Löwis" wrote: > Am 26.01.2011 10:40, schrieb Victor Stinner: > > Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit : > >> Why not locale: > >> * Relying on locale is simply not portable. (...) > >> * Mixing of modules from different locales won't work. (...) > > > > I don't understand what you are talking about. > > I think by "portability", he means "moving files from one computer to > another". He argues that if Python would mandate UTF-8 for all file > names on Unix, moving files in such a way would support portability, > whereas using the locale's filename might not (if the locale use a > different charset on the target system). > > While this is technically true, I don't think it's a helpful way of > thinking: by mandating that file names are UTF-8 when accessed from > Python, we make the actual files inaccessible on both the source and > the target system. > > > I don't understand the relation between the local filesystem encoding > > and the portability. I suppose that you are talking about the > > distribution of a module to other computers. Here the question is how > > the filenames are stored during the transfer. The user is free to use > > any tool, and try to find a tool handling Unicode correctly :-) But it's > > no more the Python problem. > > There are cases where there is no real "transfer", in the sense in which > you are using the word. For example, with NFS, you can access the very > same file simultaneously on two systems, with no file name conversion > (unless you are using NFSv4, and unless your NFSv4 implementations > support the UTF-8 mandate in NFS well). > > Also, if two users of the same machine have different locale settings, > the same file name might be interpreted differently. > Thanks Martin, I think that you understand my view even if you don't share it. There's one further case that I am worried about that has no real "transfer". Since people here seem to think that unicode module names are the future (for instance, the comments about redefining the C locale to include utf-8 and the comments about archiving tools needing to support encoding bits), there are eventually going to be unicode modules that become dependencies of other modules and programs. These will need to be installed on systems. Linux distributions that ship these will need to choose a filesystem encoding for the filenames of these. Likely the sensible thing for them to do is to use utf-8 since all the ones I can think of default to utf-8. But, as Stephen and Victor have pointed out, users change their locale settings to things that aren't utf-8 and save their modules using filenames in that encoding. When they update their OS to a version that has utf-8 python module names, they will find that they have to make a choice. They can either change their locale settings to a utf-8 encoding and have the system installed modules work or they can leave their encoding on their non-utf-8 encoding and have the modules that they've created on-site work. This is not a good position to put users of these systems in. -Toshio pgpRiKtOLoK13.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Toshio Kuratomi: > When they update their OS to a version that has > utf-8 python module names, they will find that they have to make a choice. > They can either change their locale settings to a utf-8 encoding and have > the system installed modules work or they can leave their encoding on their > non-utf-8 encoding and have the modules that they've created on-site work. When switching to a UTF-8 locale, they can also change the file names of their modules to be encoded in UTF-8. It would be fairly easy to write a script that identifies non-ASCII file names in a directory and offers to transcode their names from their current encoding to UTF-8. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/26/2011 4:47 PM, Toshio Kuratomi wrote: There's one further case that I am worried about that has no real "transfer". Since people here seem to think that unicode module names are the future (for instance, the comments about redefining the C locale to include utf-8 and the comments about archiving tools needing to support encoding bits), there are eventually going to be unicode modules that become dependencies of other modules and programs. These will need to be installed on systems. Linux distributions that ship these will need to choose a filesystem encoding for the filenames of these. Likely the sensible thing for them to do is to use utf-8 since all the ones I can think of default to utf-8. But, as Stephen and Victor have pointed out, users change their locale settings to things that aren't utf-8 and save their modules using filenames in that encoding. When they update their OS to a version that has utf-8 python module names, they will find that they have to make a choice. They can either change their locale settings to a utf-8 encoding and have the system installed modules work or they can leave their encoding on their non-utf-8 encoding and have the modules that they've created on-site work. This is not a good position to put users of these systems in. The way this case should work, is that programs that install files (installation is a form of transfer) should transform their names from the encoding used in the transfer medium to the encoding of the filesystem on which they are installed. Python3 should access the files, transforming the names from the encoding of the filesystem on which they are installed to Unicode for use by the program. I think Python3 is trying to do its part, and Victor is trying to make that more robust on more platforms, specifically Windows. The programs that install files, which may include programs that install Python files I don't know, may or may not be doing their part, but clearly there are cases where they do not. Systems that have different encodings for names on the same or different file systems need to have a way to obtain the encoding for the file names, so they can be properly decoded. If they don't have such a way, they are broken. = The rest of this is an attempt to describe the problem of Linux and other systems which use byte strings instead of character strings as file names. No problem, as long as programs allow byte strings as file names. Python3 does not, for the import statement, thus the problem is relevant for discussion here, as has been ongoing. = Since file names are defined to be byte strings, there is no way to obtain the encoding for file names, so they cannot always be decoded, and sometimes not properly decoded, because no one knows which encoding was used to create them, _if any_. Hence, Linux programs that use character strings as file names internally and expect them to match the byte strings in the file system are promoting a fiction: that there is a transformation (encoding) from character strings to byte strings that will match. When using ASCII character strings, they can be transformed to bytes using a simple transformation: identity... but that isn't necessarily correct, if the files were created using EBCDIC (unlikely on Linux systems, but not impossible, since Linux files are byte strings). When using non-ASCII character strings, the fiction promoted is even bigger, and the transformation even harder. Any 8-bit character encoding can pretend that identity is the correct transformation, but the result is mojibake if it isn't. Unicode other multi-byte encodings have an even harder job, because there can be 8-bit sequences that are not legal for some transformations, but are legal for others. This is when the fiction is exposed! As the recent description of glib points out, when the file names are read as bytes, and shown to the user for selection, possibly using some mojibake-generating transformation to characters, the user has a fighting chance to pick the right file, less chance if the transformation is lossy ('?' substitutions, etc.) and/or the names are redundant in their lossless characters. However, when the specification of the name is in characters (such as for Python import, or file names specified as character constants in any application system that provides/permits such), and there are large numbers of transformations that could be used to convert characters to bytes, the problem is harder, and error-prone... programs that want to promote the fiction of using characters for filenames must work harder. It seems that Python on Linux is such a program. One technique is to have conventions agreed on by applications and users to limit the number of encodings used on a particular system to one (optimal) or a few, the latter requires understanding that files created in one encoding may not be accessible by systems that use a diffe
Re: [Python-Dev] PEP 393: Flexible String Representation
On Mon, Jan 24, 2011 at 3:20 PM, Antoine Pitrou wrote: > Le mardi 25 janvier 2011 à 00:07 +0100, "Martin v. Löwis" a écrit : >> >> I'd like to propose PEP 393, which takes a different approach, >> >> addressing both problems simultaneously: by getting a flexible >> >> representation (one that can be either 1, 2, or 4 bytes), we can >> >> support the full range of Unicode on all systems, but still use >> >> only one byte per character for strings that are pure ASCII (which >> >> will be the majority of strings for the majority of users). >> > >> > For this kind of experiment, I think a concrete attempt at implementing >> > (together with performance/memory savings numbers) would be much more >> > useful than an abstract proposal. >> >> I partially agree. An implementation is certainly needed, but there is >> nothing wrong (IMO) with designing the change before implementing it. >> Also, several people have offered to help with the implementation, so >> we need to agree on a specification first (which is actually cheaper >> than starting with the implementation only to find out that people >> misunderstood each other). > > I'm not sure it's really cheaper. When implementing you will probably > find out that it makes more sense to change the meaning of some fields, > add or remove some, etc. You will also want to try various tweaks since > the whole point is to lighten the footprint of unicode strings in common > workloads. Yep. This is only a proposal, an implementation will allow all of that to be experimented with. I have frequently see code today, even in python 2.x, that suffers greatly from unicode vs string use (due to APIs in some code that were returning unicode objects unnecessarily when the data was really all ascii text). python 3.x only increases this as the default for so many things passes through unicode even for programs that may not need it. > > So, the only criticism I have, intuitively, is that the unicode > structure seems to become a bit too large. For example, I'm not sure you > need a generic (pointer, size) pair in addition to the > representation-specific ones. I believe the intent this pep is aiming at is for the existing in memory structure to be compatible with already compiled binary extension modules without having to recompile them or change the APIs they are using. Personally I don't care at all about preserving that level of binary compatibility, it has been convenient in the past but is rarely the right thing to do. Of course I'd personally like to see PyObject nuked and revisited, it is too large and is probably not cache line efficient. > > Incidentally, to slightly reduce the overhead the unicode objects, > there's this proposal: http://bugs.python.org/issue1943 Interesting. But that aims more at cpu performance than memory overhead. What I see is programs that predominantly process ascii data yet waste memory on a 2-4x data explosion of the internal representation. This PEP aims to address that larger target. -gps ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com