Re: [Python-Dev] Should ftplib use UTF-8 instead of latin-1 encoding?
Oleg Broytmann wrote: On Fri, Jan 23, 2009 at 02:35:01PM -0500, rdmur...@bitdance.com wrote: Given that a Unix OS can't know what encoding a filename is in (*), I can't see that one could practically implement a Unix FTP server in any other way. Can you believe there is a well-known program that solved the issue?! It is Apache web server! One can configure different directories and different file types to have different encodings. I often do that. One (sysadmin) can even allow users to do the configuration themselves via .htaccess local files. I am pretty sure FTP servers could borrow some ideas from Apache in this area. But they don't. Pity. :( AFAIK, Apache is in the same boat as ftp servers. You're thinking of the encoding inside of the files. The problem is with the file names themselves. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Ext4 data loss
Antoine Pitrou wrote: Steven D'Aprano steve at pearwood.info writes: It depends on what you mean by temporary. Applications like OpenOffice can sometimes recover from an application crash or even a systems crash and give you the opportunity to restore the temporary files that were left lying around. For such files, you want deterministic naming in order to find them again, so you won't use the tempfile module... Something that doesn't require deterministicly named tempfiles was Ted T'so's explanation linked to earlier. read data from important file modify data create tempfile write data to tempfile *sync tempfile to disk* mv tempfile to filename of important file The sync is necessary to ensure that the data is written to the disk before the old file overwrites the new filename. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Ext4 data loss
Martin v. Löwis wrote: Something that doesn't require deterministicly named tempfiles was Ted T'so's explanation linked to earlier. read data from important file modify data create tempfile write data to tempfile *sync tempfile to disk* mv tempfile to filename of important file The sync is necessary to ensure that the data is written to the disk before the old file overwrites the new filename. You still wouldn't use the tempfile module in that case. Instead, you would create a regular file, with the name base on the name of the important file. Uhm... why? The requirements are: 1) lifetime of the temporary file is in control of the app 2) filename is available to the app so it can move it after data is written 3) temporary file can be created on the same filesystem as the important file. All of those are doable using the tempfile module. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Ext4 data loss
Martin v. Löwis wrote: The sync is necessary to ensure that the data is written to the disk before the old file overwrites the new filename. You still wouldn't use the tempfile module in that case. Instead, you would create a regular file, with the name base on the name of the important file. Uhm... why? Because it's much easier not to use the tempfile module, than to use it, and because the main purpose of the tempfile module is irrelevant to the specific application; the main purpose being the ability to auto-delete the file when it gets closed. auto-delete is one of the nice features of tempfile. Another feature which is entirely appropriate to this usage, though, though, is creation of a non-conflicting filename. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Ext4 data loss
Martin v. Löwis wrote: auto-delete is one of the nice features of tempfile. Another feature which is entirely appropriate to this usage, though, though, is creation of a non-conflicting filename. Ok. In that use case, however, it is completely irrelevant whether the tempfile module calls fsync. After it has generated the non-conflicting filename, it's done. If you're saying that it shouldn't call fsync automatically I'll agree to that. The message thread I was replying to seemed to say that tempfiles didn't need to support fsync because they will be useless after a system crash. I'm just refuting that by showing that it is useful to call fsync on tempfiles as one of the steps in preserving the data in another file. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Integrate BeautifulSoup into stdlib?
Stephen J. Turnbull wrote: Chris Withers writes: - debian has an outdated and/or broken version of your package. True, but just as for the package system you are advocating, it's quite easy to set up your apt to use third-party repositories of Debian-style packages. The question is whether those repositories exist. Introducing yet another, domain-specific package manager will make it less likely that they do, and it will cause more work for downstream distributors like Debian and RH. I haven't seen this mentioned so -- For many sites (including Fedora, the one I work on), the site maintains a local yum/apt repository of packages that are necessary for getting certain applications to run. This way we are able to install a system with a distribution that is maintained by other people and have local additions that add more recent versions only where necessary. This has the following advantages: 1) We're able to track our changes to the base OS. 2) If the OS vendor releases an update that includes our fixes, we're able to consume it without figuring out on which boxes we have to delete what type of locally installed file (egg, jar, gem, /usr/local/bin/program, etc). 3) We're using the OS vendor package management system for everything so junior system admins can bootstrap a new machine with only familiarity with that OS. We don't have to teach them about rpm + eggs + gems + where to find our custom repositories of each. 4) If we choose to, we can separate out different repositories for different sets of machines. Currently we have the main local repo and one repo that only the builders pull from. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Integrate BeautifulSoup into stdlib?
Steve Holden wrote: Seems to me that while all this is fine for developers and Python users it's completely unsatisfactory for people who just want to use Python applications. For them it's much easier if each application comes with all dependencies including the interpreter. This may seem wasteful, but it removes many of the version compatibility issues that otherwise bog things down. The upfront cost of bundling is lower but the maintenance cost is higher. For instance, OS vendors have developed many ways of being notified of and dealing with security issues. If there's a security issue with gtkmozdev and the python bindings to it have to be recompiled, OS vendors will be alerted to it and have the opportunity to release updates on zero day, the day that the security announcement goes out. Bundled applications suffer in two ways here: 1) the developers of the applications are unlikely to be on vendor-sec and so the opportunity for zero day fixes is lower. 2) the developer becomes responsible for fixing problems with the libraries, something that they often do not. This is especially true when developers start depending, not only on newer features of some libraries, but older versions of others (for API changes). It's not clear to many developers that requiring a newer version of a library is at least supported by upstream whereas requiring an older version leaves them as the sole responsible party. 3) Over time, bundled libraries tend to become forked versions. And worse, privately forked versions. If three python apps all use slightly different older versions of libfoo-python and have backported fixes, added new features, etc it is a nightmare for a system administrator or packager to get them running with a single version from the system library or forward port them. And because they're private forks the developers lose out on collaborating on security, bugfixes, etc because they are doing their work in isolation from the other forks. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Integrate BeautifulSoup into stdlib?
David Cournapeau wrote: 2009/3/24 Toshio Kuratomi a.bad...@gmail.com: Steve Holden wrote: Seems to me that while all this is fine for developers and Python users it's completely unsatisfactory for people who just want to use Python applications. For them it's much easier if each application comes with all dependencies including the interpreter. This may seem wasteful, but it removes many of the version compatibility issues that otherwise bog things down. The upfront cost of bundling is lower but the maintenance cost is higher. For instance, OS vendors have developed many ways of being notified of and dealing with security issues. If there's a security issue with gtkmozdev and the python bindings to it have to be recompiled, OS vendors will be alerted to it and have the opportunity to release updates on zero day, the day that the security announcement goes out. I don't think bundling should be compared to depending on the system libraries, but as a lesser evil compared to requiring multiple, system-wide installed libraries. Well.. I'm not so sure it's even a win there. If the libraries are installed system-wide, at least the consumer of the application knows: 1) Where to find all the libraries to audit the versions when a security issue is announced. 2) That the library is unforked from upstream. 3) That all the consumers of the library version have a central location to collaborate on announcing fixes to the library. With my distribution packager hat on, I can say I dislike both multiple versions and bundling but I definitely dislike bundling more. 3) Over time, bundled libraries tend to become forked versions. And worse, privately forked versions. If three python apps all use slightly different older versions of libfoo-python and have backported fixes, added new features, etc it is a nightmare for a system administrator or packager to get them running with a single version from the system library or forward port them. And because they're private forks the developers lose out on collaborating on security, bugfixes, etc because they are doing their work in isolation from the other forks. This is a purely technical problem, and can be handled by good source control systems, no ? No. This is a social problem. Good source control only helps if I am tracking upstream's trunk so I'm aware of the direction that their changes are headed. But there's a wide range of reasons that application developers that bundle libraries don't do that: 1) not enough time in a day. I'm working full-time on making my application better. Plus I have to update all these bundled libraries from time to time, testing that the updates don't break anything. I don't have time to track trunk for all these libraries -- I barely have time to track releases. 2) My release schedule doesn't mesh with all of the upstream libraries I'm bundling. When I want to release Foo-1.0, I want to have some assurance that the libraries I'm bundling with will do the right thing. Since releases see more testing than trunk, tracking trunk for twenty bundled libraries is a lot less attractive than tracking release branches. 3) This doesn't help with the fact that my bundled version of the library and your bundled version of the library are being developed in isolation from each other. This needs central coordination which people who believe bundling libraries are very unlikely to pursue. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Integrate BeautifulSoup into stdlib?
Tres Seaver wrote: David Cournapeau wrote: I am afraid that distutils, and setuptools, are not really the answer to the problem, since while they may (as intended) guarantee that Python applications can be installed uniformly across different platforms they also more or less guarantee that Python applications are installed differently from all other applications on the platform. I think they should be part of the solution, in the sense that they should allow easier packaging for the different platforms (linux, windows, mac os x and so on). For now, they make things much harder than they should (difficult to follow the FHS, etc...). FHS is something which packagers / distributors care about: I strongly doubt that the end users will ever notice, particularly for silliness like 'bin' vs. 'sbin', or architecture-specific vs. 'noarch' rules. That's because you're thinking of a different class of end-user than FHS is targeting. Someone who wants to install a web application on a limited number of machines (one in the home-desktop scenario) or someone who makes their living helping people to install the software they've written has a whole different view on things than someone who's trying to install and maintain the software on fifteen computer labs in a campus or the person who is trying to write software that is portable to tens of different platforms in their spare time and every bit of answering end user's questions, tracking other upstreams for security bugs, etc, is time taken away from coding. Following FHS means that the software will work for both end-users who don't care about the nitty-gritty of the FHS and system administrators of large sites. Disregarding the FHS because it is silliness means that system administrators are going to have to special-case your application, decide not to install it at all, or pay someone else to support it. Note that those things do make sense sometimes. For instance, when an application is not intended to be distributed to a large number of outside entities (facebook, flikr, etc) or when your revenue stream is making money from installing and administering a piece of software for other companies. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Integrate BeautifulSoup into stdlib?
David Cournapeau wrote: On Wed, Mar 25, 2009 at 1:45 AM, Toshio Kuratomi a.bad...@gmail.com wrote: David Cournapeau wrote: 2009/3/24 Toshio Kuratomi a.bad...@gmail.com: Steve Holden wrote: Seems to me that while all this is fine for developers and Python users it's completely unsatisfactory for people who just want to use Python applications. For them it's much easier if each application comes with all dependencies including the interpreter. This may seem wasteful, but it removes many of the version compatibility issues that otherwise bog things down. The upfront cost of bundling is lower but the maintenance cost is higher. For instance, OS vendors have developed many ways of being notified of and dealing with security issues. If there's a security issue with gtkmozdev and the python bindings to it have to be recompiled, OS vendors will be alerted to it and have the opportunity to release updates on zero day, the day that the security announcement goes out. I don't think bundling should be compared to depending on the system libraries, but as a lesser evil compared to requiring multiple, system-wide installed libraries. Well.. I'm not so sure it's even a win there. If the libraries are installed system-wide, at least the consumer of the application knows: 1) Where to find all the libraries to audit the versions when a security issue is announced. 2) That the library is unforked from upstream. 3) That all the consumers of the library version have a central location to collaborate on announcing fixes to the library. Yes, those are problems, but installing multi libraries have a lot of problems too: - quickly, by enabling multiple version installed, people become very sloppy to handle versions of the dependencies, and this increases a lot the number of libraries installed - so the advantages above for system-wide installation becomes intractable quite quickly This is somewhat true. Sloppiness and increased libraries are bad. But there are checks on this sloppiness. Distributions, for instance, are quite active about porting software to use only a subset of versions. So in the open source world, there's a large number of players interested in keeping the number of versions down. Using multiple libraries will point people at where work needs to be done whereas bundling hides it behind the monolithic bundle. - bundling also supports a real user-case which cannot be solved by rpm/deb AFAIK: installation without administration privileges. This is only sortof true. You can install rpms into a local directory without root privileges with a commandline switch. But rpm/deb are optimized for system administrators so the documentation on doing this is not well done. There can also be code issues with doing things this way but those issues can affect bundled apps as well. And finally, since rpm's primary use is installing systems, the toolset around it builds systems. So it's a lot easier to build a private root filesystem than it is to cherrypick a single package. It should be possible to create a tool that merges a system rpmdb and a user's local rpmdb using the existing API but I'm not aware of any applications built to do that yet. - multi-version installation give very fragile systems. That's actually my number one complain in python: setuptools has caused me numerous headache, and I got many bug reports because you often do not know why one version was loaded instead of another one. I won't argue for setuptools' implementation of multi-version. It sucks. But multi-version can be done well. Sonames in C libraries are a simple system that does this better. So I am not so convinced multiple-version is better than bundling - I can see how it sometimes can be, but I am not sure those are that important in practice. Bundling is always harmful. Whether multiple versioning is any better is certainly debatable :-) No. This is a social problem. Good source control only helps if I am tracking upstream's trunk so I'm aware of the direction that their changes are headed. But there's a wide range of reasons that application developers that bundle libraries don't do that: 1) not enough time in a day. I'm working full-time on making my application better. Plus I have to update all these bundled libraries from time to time, testing that the updates don't break anything. I don't have time to track trunk for all these libraries -- I barely have time to track releases. Yes, but in that case, there is nothing you can do. Putting everything in one project is always easier than splitting into modules, coding and deployment-wise. That's just one side of the speed of development vs maintenance issue IMHO. 3) This doesn't help with the fact that my bundled version of the library and your bundled version of the library are being developed in isolation from each other. This needs central coordination which people who believe bundling
Re: [Python-Dev] Integrate BeautifulSoup into stdlib?
Barry Warsaw wrote: Tools like setuptools, zc.buildout, etc. seem great for developers but not very good for distributions. At last year's Pycon I think there was agreement from the Linux distributors that distutils, etc. just wasn't very useful for them. It's decent for modules but has limitations that we run up against somewhat frequently. It's a horror for applications. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Integrate BeautifulSoup into stdlib?
David Cournapeau wrote: I won't argue for setuptools' implementation of multi-version. It sucks. But multi-version can be done well. Sonames in C libraries are a simple system that does this better. I would say simplistic instead of simple :) what works for C won't necessarily work for python - and even in C, library versioning is not used that often except for a few core libraries. Library versioning works in C because C model is very simple. It already breaks for C++. I'm not sure what you're talking about here. Library versioning is used for practically every library on a Linux system. My limited exposure to the BSDs and Solaris was the same. (If you're only talking Windows, well; does windows even have Sonames?) I can name only one library that isn't versioned in Fedora right now and may have heard of five total. Perhaps you are thinking of library symbols? If so, there are only a few libraries that are using that. But specifying backwards compatibility via soname is well known and ubiquitous. More high-level languages like C# already have a more complicated scheme (GAC) - and my impression is that it did not work that well. The SxS for dll on recent windows to handle multiple version is a nightmare too in my (limited) experience. Looking at C#/Mono/.net for examples is perfectly horrid. They've taken inferior library versioning and bad development practices and added technology (the GAC) as the solution. If you want an idea of what python should avoid at all costs, look to that arena for your answer. * Note that setuptools' multi-version implementation shares some things in common with the GAC. For instance, using directories to separate versions instead of filenames. setuptools' implementation could be made better by studying the GAC and taking things like caching of lookups from it but I don't encourage this... I think the design itself is flawed. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] setuptools has divided the Python community
Guido van Rossum wrote: On Wed, Mar 25, 2009 at 9:40 PM, Tarek Ziadé ziade.ta...@gmail.com wrote: I think Distutils (and therefore Setuptools) should provide some APIs to play with special files (like resources) and to mark them as being special, no matter where they end up in the target system. So the code inside the package can use these files seamessly no matter what the system is and no matter where the files have been placed by the packager. This has been discussed already but not clearly defined. Yes, this should be done. PEP 302 has some hooks but they are optional and not available for the default case. A simple wrapper to access a resource file relative to a given module or package would be easy to add. It should probably support four APIs: - Open as a binary stream - Open as a text stream - Get contents as a binary string - Get contents as a text string Depending on the definition of a resource there's additional information that could be needed. For instance, if resource includes message catalogs, then being able to get the base directory that the catalogs reside in is needed for passing to gettext. I'd be very happy if resource didn't encompass that type of thing, though... then we could have a separate interface that addressed the issues with them. I'll be at PyCon (flying in late tonight, though, and leaving Sunday) if Tarek and others want to get ahold of me to discuss possible ways to address what's a resource, what's not, and what we would need to handle the different cases. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] setuptools has divided the Python community
Guido van Rossum wrote: 2009/3/26 Toshio Kuratomi a.bad...@gmail.com: Guido van Rossum wrote: On Wed, Mar 25, 2009 at 9:40 PM, Tarek Ziadé ziade.ta...@gmail.com wrote: I think Distutils (and therefore Setuptools) should provide some APIs to play with special files (like resources) and to mark them as being special, no matter where they end up in the target system. So the code inside the package can use these files seamessly no matter what the system is and no matter where the files have been placed by the packager. This has been discussed already but not clearly defined. Yes, this should be done. PEP 302 has some hooks but they are optional and not available for the default case. A simple wrapper to access a resource file relative to a given module or package would be easy to add. It should probably support four APIs: - Open as a binary stream - Open as a text stream - Get contents as a binary string - Get contents as a text string Depending on the definition of a resource there's additional information that could be needed. For instance, if resource includes message catalogs, then being able to get the base directory that the catalogs reside in is needed for passing to gettext. Well the whole point is that for certain loaders (e.g. zip files) there *is* no base directory. If you do need directories you won't be able to use PEP-302 loaders, and you can just use os.path.dirname(some_module.__file__). Yep. Having no base directory isn't sufficient in all cases. So one way to fix this is to define resources so that these cases fall outside of that. Current setuptools works around this by having API in pkg_resources that unzips when it's necessary to use a filename rather than just retrieving the data from the file. So a second option is to have other API methods that allow this. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Robert Collins wrote: Certainly, import time is part of it: robe...@lifeless-64:~$ python -m timeit -s 'import sys; import bzrlib.errors' del sys.modules['bzrlib.errors']; import bzrlib.errors 10 loops, best of 3: 18.7 msec per loop (errors.py is 3027 lines long with 347 exception classes). We've also looked lower - python does a lot of stat operations search for imports and determining if the pyc is up to date; these appear to only really matter on cold-cache imports (but they matter a lot then); in hot-cache situations they are insignificant. Tarek, Georg, and I talked about a way to do both multi-version and speedup of this exact problem with import in the future at pycon. I had to leave before the hackfest got started, though, so I don't know where the idea went from there. Tarek, did this idea progress any? -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] #!/usr/bin/env python -- python3 where applicable
Greg Ewing wrote: Steven Bethard wrote: That's an unfortunate decision. When the 2.X line stops being maintained (after 2.7 maybe?) we're going to be stuck with the 3 suffix forever for the real Python. I don't see why we have to be stuck with it forever. When 2.x has faded into the sunset, we can start aliasing 'python' to 'python3' if we want, can't we? You could, but it's not my favorite idea. Gets people used to the idea of python == python2 and python3 == python3 as something they can count on. Then says, Oops, that was just an implementation detail, we're changing that now. Much better to either make a clean break and call the new language dialect python3 from now and forever or force people to come up with solutions to whether /usr/bin/python == python2 or python3 right now while it's fresh and relevant in their minds. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman wrote: On approximately 4/24/2009 11:40 AM, came the following characters from And so my encoding (1) doesn't alter the data stream for any valid Windows file name, and where the naivest of users reside (2) doesn't alter the data stream for any Posix file name that was encoded as UTF-8 sequences and doesn't contain ? characters in the file name [I perceive the use of ? in file names to be rare on Posix, because of experience, and because of the other problems caused by such use] (3) doesn't introduce data puns within applications that are correctly coded to know the encoding occurs. The encoding technique in the PEP not only can produce data puns, thus not being reversible, it provides no reliable mechanism to know that this has occurred. Uhm Not arguing with your goals but '?' is unfortunately reasonably easy to get into a filename. For instance, I've had to download a lot of scratch built packages from our buildsystem recently. Scratch builds have url's with query strings in them so:: wget 'http://koji.fedoraproject.org/koji/getfile?taskID=1318059name=monodevelop-debugger-gdb-2.0-1.1.i586.rpm' Which results in the filename: getfile?taskID=1318059name=monodevelop-debugger-gdb-2.0-1.1.i586.rpm -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Terry Reedy wrote: Is NUL \0 allowed in POSIX file names? If not, could that be used as an escape char. If it is not legal, then custom translated strings that escape in the wild would raise a red flag as soon as something else tried to use them. AFAIK NUL should be okay but I haven't read a specification to reach that conclusion. Is that a proposal? Should I go find someone who has read the relevant standards to find out? -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Zooko O'Whielacronx wrote: On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. Why do you say that? It seems to work as I expected here: '\xff'.decode('iso-8859-15') u'\xff' '\xc3\xbf'.decode('iso-8859-15') u'\xc3\xbf' '\xff'.decode('cp1252') u'\xff' '\xc3\xbf'.decode('cp1252') u'\xc3\xbf' You're not showing that this is a fallback path. What won't work is first trying a local encoding (in the following example, utf-8) and then if that doesn't work, trying a one-byte encoding like iso8859-15: try: file1 = '\xff'.decode('utf-8') except UnicodeDecodeError: file1 = '\xff'.decode('iso-8859-15') print repr(file1) try: file2 = '\xc3\xbf'.decode('utf-8') except UnicodeDecodeError: file2 = '\xc3\xbf'.decode('iso-8859-15') print repr(file2) That prints: u'\xff' u'\xff' The two encodings can map different bytes to the same unicode code point so you can't do this type of thing without recording what encoding was used in the translation. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: Since the serialization of the Unicode string is likely to use UTF-8, and the string for such a file will include half surrogates, the application may raise an exception when encoding the names for a configuration file. These encoding exceptions will be as rare as the unusual names (which the careful I18N aware developer has probably eradicated from his system), and thus will appear late. There are trade-offs to any solution; if there was a solution without trade-offs, it would be implemented already. The Python UTF-8 codec will happily encode half-surrogates; people argue that it is a bug that it does so, however, it would help in this specific case. Can we use this encoding scheme for writing into files as well? We've turned the filename with undecodable bytes into a string with half surrogates. Putting that string into a file has to turn them into bytes at some level. Can we use the python-escape error handler to achieve that somehow? -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Thomas Breuel wrote: Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type help, copyright, credits or license for more information. f=open(chr(255),'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' (Some file system drivers do not enforce valid utf8 yet, but I suspect they will in the future.) Do you suspect that from discussing the issue with kernel developers or reading a thread on lkml? If not, then you're suspicion seems to be pretty groundless The fact that VFAT enforces an encoding does not lend itself to your argument for two reasons: 1) VFAT is not a Unix filesystem. It's a filesystem that's compatible with Windows/DOS. If Windows and DOS have filesystem encodings, then it makes sense for that driver to enforce that as well. Filesystems intended to be used natively on Linux/Unix do not necessarily make this design decision. 2) The encoding is specified when mounting the filesystem. This means that you can still mix encodings in a number of ways. If you mount with an encoding that has full byte coverage, for instance, each user can put filenames from different encodings on there. If you mount with utf8 on a system which uses euc-jp as the default encoding, you can have full paths that contain a mix of utf-8 and euc-jp. Etc. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Promoting Python 3 [was: PyPy 1.7 - widening the sweet spot]
On Wed, Nov 23, 2011 at 01:41:46AM +0900, Stephen J. Turnbull wrote: Barry Warsaw writes: Hopefully, we're going to be making a dent in that in the next version of Ubuntu. This is still a big mess in Gentoo and MacPorts, though. MacPorts hasn't done anything about ceating a transition infrastructure AFAICT. Gentoo has its eselect python set VERSION stuff, but it's very dangerous to set to a Python 3 version, as many things go permanently wonky once you do. (So far I've been able to work around problems this creates, but it's not much fun.) I have no experience with this in Debian, Red Hat (and derivatives) or *BSD, but I have to suspect they're no better. (Well, maybe Red Hat has learned from its 1.5.2 experience! :-) For Fedora (and currently, Red Hat is based on Fedora -- a little more about that later, though), we have parallel python2 and python3 stacks. As time goes on we've slowly brought more python-3 compatible modules onto the python3 stack (I believe someone had the goal a year and a half ago to get a complete pylons web development stack running on python3 on Fedora which brought a lot of packages forward). Unlike Barry's work with Ubuntu, though, we're mostly chiselling around the edges; we're working at the level where there's a module that someone needs to run something (or run some optional features of something) that runs on python3. I don't have any connections to the distros, so can't really offer to help directly. I think it might be a good idea for users to lobby (politely!) their distros to work on the transition. Where distros aren't working on parallel stacks, there definitely needs to be some transition plan. With my experience with parallel stacks, the best help there is to 1) help upstreams port to py3k (If someone can get PIL's py3k support finished and into a released package, that would free up a few things). 2) open bugs or help with creating python3 packages of modules when the upstream support is there. Depending on what software Barry's talking about porting to python3, that could be a big incentive as well. Just like with the push in Fedora to have pylons run on python3, I think that having certain applications that run on python3 and therefore need to have stacks of modules that support it is one of the prime ways that distros become motivated to provide python3 packages and support. This is basically the killer app idea in a new venue :-) -Toshio pgp4H9ogaSy0g.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3.4 Release Manager
On Tue, Nov 22, 2011 at 08:27:24PM -0800, Raymond Hettinger wrote: On Nov 22, 2011, at 7:50 PM, Larry Hastings wrote: But look! I'm already practicing: NO YOU CAN'T CHECK THAT IN. How's that? Needs work? You could try a more positive leadership style: THAT LOOKS GREAT, I'M SURE THE RM FOR PYTHON 3.5 WILL LOVE IT ;-) Wow! My release engineering team needs to take classes from you guys! -Toshio pgpuU9lyX1YFu.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fwd: Anyone still using Python 2.5?
On Thu, Dec 22, 2011 at 02:49:06AM +0100, Victor Stinner wrote: Do people still have to use this in commercial environments or is everyone on 2.6+ nowadays? At work, we are still using Python 2.5. Six months ago, we started a project to upgrade to 2.7, but we have now more urgent tasks, so the upgrade is delayed to later. Even if we upgrade new clients to 2.7, we will have to continue to support 2.5 for some more months (or years?). At my work, I'm on RHEL5 and RHEL6. So I'm currently supporting python-2.4 and python-2.6. We're up to 75% RHEL6 (though, not the machines where most of our deployed, custom written apps are running) so I shouldn't have to support python-2.4 for much longer. In a personal project (the IPy library), I dropped support of Python 2.5 in february 2011. Recently, I got a mail asking me where the previous version of my library (supporting Python 2.4) can be downloaded! Someone is still using Python 2.4: I'm stuck with python 2.4 in my work environment. As part of work, I package for EPEL5 (addon packages for RHEL5). Sometimes we need a new version of a package or a new package for RHEL5 and thus need to have python-2.4 compatible versions of the package and any of its dependencies. When I no longer need to maintain python-2.4 stuff for work, I'm hoping to not have to do quite so much of this but sometimes I know I'll still get requests to update an existing package to fix a bug or fix a feature and that will require updates of dependent libraries. I'll still be stuck looking for python-2.4 compatible versions of all of these :-( What do people feel? For a new project, try to support Python 2.5, especially if you would like to write a portable library. For a new application working on Mac OS X, Windows and Linux, you can only support Python 2.6. I agree that libraries have a need to go farther back than applications. I have one library that I support on python-2.3 (for RHEL4... I'm counting down the months on that one :-). Every other library I maintain, I make sure I support at least python-2.4. Application-wise, I currently have to support python-2.4+ but given that Linux distros seem to all have some version out that supports at least python-2.6, I don't think I'll be developing any applications that intentionally support less than that once I get moved away from RHEL-5 at my workplace. -Toshio pgpxLKFA2jIf4.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
On Thu, Jan 05, 2012 at 08:35:57PM +, Paul Moore wrote: On 5 January 2012 19:33, David Malcolm dmalc...@redhat.com wrote: We have similar issues in RHEL, with the Python versions going much further back (e.g. 2.3) When backporting the fix to ancient python versions, I'm inclined to turn the change *off* by default, requiring the change to be enabled via an environment variable: I want to avoid breaking existing code, even if such code is technically relying on non-guaranteed behavior. But we could potentially tweak mod_python/mod_wsgi so that it defaults to *on*. That way /usr/bin/python would default to the old behavior, but web apps would have some protection. Any such logic here also suggests the need for an attribute in the sys module so that you can verify the behavior. Uh, surely no-one is suggesting backporting to ancient versions? I couldn't find the statement quickly on the python.org website (so this is via google), but isn't it true that 2.6 is in security-only mode and 2.5 and earlier will never get the fix? I think when dmalcolm says backporting he means that he'll have to backport the fix from modern, supported-by-python.org python to the ancient python's that he's supporting as part of the Linux distributions where he's the python package maintainer. I'm thinking he's mentioning it here mainly to see if someone thinks that his approach for those distributions causes anyone to point out a reason not to diverge from upstream in that manner. Having a source-only release for 2.6 means the fix is off by default in the sense that you can choose not to build it. Or add a #ifdef to the source if it really matters. I don't think that this would satisfy dmalcolm's needs. What he's talking about sounds more like a runtime switch (possibly only when initializing, though, not on-the-fly). -Toshio pgp7qk95cGJ9b.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 411: Provisional packages in the Python standard library
On Sat, Feb 11, 2012 at 04:32:56PM +1000, Nick Coghlan wrote: This would then be seen by pydoc and help(), as well as being amenable to programmatic inspection. Would using warnings.warn('This is a provisional API and may change radically from' ' release to release', ProvisionalWarning) where ProvisionalWarning is a new exception/warning category (a subclaass of FutureWarning?) be considered too intrusive? -Toshio pgpsUYqg9uSvm.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] #12982: Should -O be required to *read* .pyo files?
On Wed, Jun 13, 2012 at 01:58:10PM -0400, R. David Murray wrote: OK, but you didn't answer the question :). If I understand correctly, everything you said applies to *writing* the bytecode, not reading it. So, is there any reason to not use the .pyo file (if that's all that is around) when -O is not specified? The only technical reason I can see why -O should be required for a .pyo file to be used (*if* it is the only thing around) is if it won't *run* without the -O switch. Is there any expectation that that will ever be the case? Yes. For instance, if I create a .pyo with -OO it wouldn't have docstrings. Another piece of code can legally import that and try to use the docstring for something. This would fail if only the .pyo was present. Of course, it would also fail under the present behaviour since no .py or .pyc was present to be imported. The error that's displayed might be clearer if we fail when attempting to read a .py/.pyc rather than failing when the docstring is found to be missing, though. -Toshio pgpqk9ErpLKEV.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Python-3.0, unicode, and os.environ
I opened up bug http://bugs.python.org/issue4006 a while ago and it was suggested in the report that it's not a bug but a feature and so I should come here to see about getting the feature changed :-) I have a specific problem with os.environ and a somewhat less important architectural issue with the unicode/bytes handling in certain os.* modules. I'll start with the important one: Currently in python3 there's no way to get at environment variables that are not encoded in the system default encoding. My understanding is that this isn't a problem on Windows systems but on *nix this is a huge problem. environment variables on *nix are a sequence of non-null bytes. These bytes are almost always characters but they do not have to be. Further, there is nothing that requires that the characters be in the same encoding; some of the characters could be in the UTF-8 character set while others are in latin-1, shift-jis, or big-5. These mixed encodings can occur for a variety of reasons. Here's an example that isn't too contrived :-) Swallow is a multi-user shell server hosted at a university in Japan. The OS installed is Fedora 10 where the encoding of all filenames provided by the OS are UTF-8. The administrator of the OS has kept this convention and, among other things has created a directory to mount and NFS directory from another computer. He calls that ネットワーク (network in Japanese). Since it's utf-8, that gets put on the filesystem as '\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf' Now the administrators of the fileserver have been maintaining it since before Unicode was invented. Furthermore, they don't want to suffer from the space loss of using utf-8 to encode Japanese so they use shift-jis everywhere. They have a directory on the nfs share for programs that are useful for people on the shell server to access. It's called プログラム (programs in Japanese) Since they're using shift-jis, the bytes on the filesystem are: '\x83v\x83\x8d\x83O\x83\x89\x83\x80' The system administrator of the shell server adds the directory of programs to all his user's default PATH variables so then they have this: PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80 (Note: python syntax, In the unix shell you'd likely have octal instead of hex) Now comes the problematic part. One of the user's on the system wants to write a python3 program that needs to determine if a needed program is in the user's PATH. He tries to code it like this:: #!/usr/bin/python3.0 import os for directory in os.environ['PATH']: programs = os.listdir(directory) That code raises a KeyError because python3 has silently discarded the PATH due to the shift-jis encoded path elements. Much more importantly, there's no way the programmer can handle the KeyError and actually get the PATH from within python. In the bug report I opened, I listed four ways to fix this along with the pros and cons: 1) return mixed unicode and byte types in os.environ and os.getenv - I think this one is a bad idea. It's the easiest for simple code to deal with but it's repeating the major problem with python2's Unicode handling: mixing unicode and byte types unpredictably. 2) return only byte types in os.environ - This is conceptually correct but the most annoying option. Technically we're receiving bytes from the C libraries and the C libraries expect bytes in return. But in the common case we will be dealing with things in one encoding so this causes needless effort to the application programmer in the common case. 3) silently ignore non-decodable value when accessing os.environ['PATH'] as we do now but allow access to the full information via os.environ[b'PATH'] and os.getenvb(). - This mirrors the practice of os.listdir('.') vs os.listdir(b'.') and os.getcwd() vs os.getcwdb(). 4) raise an exception when non-decodable values are *accessed* and continue as in #3. This means that os.environ wouldn't be a simple dict as it would need to decode the values when keys are accessed (although it could cache the values). - This mirrors the practice of open() which is to decode the value for the common case but throw an exception and allow the programmer to decide what to do if all values are not decodable. Either #3 or #4 will solve the major problem and both have precedent in python3's current implementation. The difference between them is whether to throw an exception when a non-decodable value is encountered. Here's why I think that's appropriate: One of the things I enjoy about python is the informative tracebacks that make debugging easy. I think that the ease of debugging is lost when we silently ignore an error. If we look at the difference in coding and debugging for problems with files that aren't encoded in the default encoding (where a traceback is issued) and os.listdir() when filenames aren't in the default
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: On Thu, Dec 4, 2008 at 1:02 PM, Toshio Kuratomi [EMAIL PROTECTED] wrote: I opened up bug http://bugs.python.org/issue4006 a while ago and it was suggested in the report that it's not a bug but a feature and so I should come here to see about getting the feature changed :-) I have a specific problem with os.environ and a somewhat less important architectural issue with the unicode/bytes handling in certain os.* modules. I'll start with the important one: Currently in python3 there's no way to get at environment variables that are not encoded in the system default encoding. My understanding is that this isn't a problem on Windows systems but on *nix this is a huge problem. environment variables on *nix are a sequence of non-null bytes. These bytes are almost always characters but they do not have to be. Further, there is nothing that requires that the characters be in the same encoding; some of the characters could be in the UTF-8 character set while others are in latin-1, shift-jis, or big-5. Multiple encoding environments are best described as batshit insane. It's impossible to handle any of it correctly *as text*, which is why UTF-8 is becoming a universal standard. For everybody's sanity python should continue to push it. Amen brother! However, some pragmatism is also possible. Unfortunately, this is exactly what I'm talking about :-) Many uses of PATH may allow it to be treated as black-box bytes, rather than text. The minimal solution I see is to make os.getenv() and os.putenv() switch to byte modes when given byte arguments, as os.listdir() does. This use case doesn't require the ability to iterate over all environment variables, as os.environb would allow. This would be a partial implementation of my option #3. It allows the programmer to workaround problems but does allow subtle bugs to creep in unawares. For instance:: I do wonder if controlling the environment given to a subprocess requires os.environb, but it may be too obscure to really matter. If you wanted to change one variable before passing it on to the subprocess this could lead to head-scratcher bugs. Here's a contrived example: Say I have an app that talks to multiple cvs repositories. It copies os.environ and modifies CVSROOT and CVS_RSH then calls subprocess with env=temp_env. If the PATH variable contains non-decodable elements on some machines, this could lead to mysterious failures. This is particularly bad because we aren't directly modifying PATH anywhere in our code so there won't be an obvious reason in the code that this is failing. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: On Thu, Dec 4, 2008 at 2:09 PM, André Malo [EMAIL PROTECTED] wrote: * Adam Olsen wrote: On Thu, Dec 4, 2008 at 1:02 PM, Toshio Kuratomi [EMAIL PROTECTED] wrote: I opened up bug http://bugs.python.org/issue4006 a while ago and it was suggested in the report that it's not a bug but a feature and so I should come here to see about getting the feature changed :-) I have a specific problem with os.environ and a somewhat less important architectural issue with the unicode/bytes handling in certain os.* modules. I'll start with the important one: Currently in python3 there's no way to get at environment variables that are not encoded in the system default encoding. My understanding is that this isn't a problem on Windows systems but on *nix this is a huge problem. environment variables on *nix are a sequence of non-null bytes. These bytes are almost always characters but they do not have to be. Further, there is nothing that requires that the characters be in the same encoding; some of the characters could be in the UTF-8 character set while others are in latin-1, shift-jis, or big-5. Multiple encoding environments are best described as batshit insane. It's impossible to handle any of it correctly *as text*, which is why UTF-8 is becoming a universal standard. For everybody's sanity python should continue to push it. Here's an example which will become popular soon, I guess: CGI scripts and, of course WSGI applications. All those get their environment in an unknown encoding. In the worst case one can blow up the application by simply sending strange header lines over the wire. But there's more: consider running the server in C locale, then probably even a single 8 bit char might break something (?). I think that's an argument that the framework should reencode all input text into the correct system encoding before passing it on to the CGI script or WSGI app. If the framework doesn't have a clear way to determine the client's encoding then it's all just gibberish anyway. A HTTP 400 or 500 range error code is appropriate here. The framework can't always encode input bytes into the system encoding for text. Sometimes the framework can be dealing with actual bytes. For instance, if the framework is being asked to reference an actual file on a *NIX filesystem the bytes have to match up with the bytes in the filename whether or not those bytes agree with the system encoding. However, some pragmatism is also possible. Many uses of PATH may allow it to be treated as black-box bytes, rather than text. The minimal solution I see is to make os.getenv() and os.putenv() switch to byte modes when given byte arguments, as os.listdir() does. This use case doesn't require the ability to iterate over all environment variables, as os.environb would allow. I do wonder if controlling the environment given to a subprocess requires os.environb, but it may be too obscure to really matter. IMHO, environment variables are no text. They are bytes by definition and should be treated as such. I know, there's windows having unicode enabled env vars on demand, but there's only trouble with those over there in apache's httpd (when passing them to CGI scripts, oh well...). Environment variables have textual names, are set via text, frequently contain textual file names or paths, and my shell (bash in gnome-terminal on ubuntu) lets me put unicode text in just fine. The underlying APIs may use bytes, but they're *intended* to be encoded text. The example I've started using recently is this: text files on my system contain character data and I expect them to be read into a string type when I open them in python3. However, if a text file contains text that is not encoded in the system default encoding I should still be able to get at the data and perform my own conversion. So I agree with the default of treating environment variables as text. We just need to be able to treat them as bytes when these corner cases come up. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Terry Reedy wrote: Toshio Kuratomi wrote: I opened up bug http://bugs.python.org/issue4006 a while ago and it was suggested in the report that it's not a bug but a feature and so I should come here to see about getting the feature changed :-) It does you no good and (and will irritate others) to conflate 'design decision I do not agree with' with 'mistaken documentation or implementation of a design decision'. The former is opinion, the latter is usually fact (with occasional border cases). The latter is what core developers mean by 'bug'. Noted. However, there's also a difference between Prevents us from doing useful things and Allows doing a useful thing in a non-trivial manner. The latter I would call a difference in design decision and the former I would call a bug in the design. Currently in python3 there's no way to get at environment variables that are not encoded in the system default encoding. My understanding is that this isn't a problem on Windows systems but on *nix this is a huge problem. environment variables on *nix are a sequence of non-null bytes. These bytes are almost always characters but they do not have to be. Further, there is nothing that requires that the characters be in the same encoding; some of the characters could be in the UTF-8 character set while others are in latin-1, shift-jis, or big-5. To me, mixing encodings within a string is at least slightly insane. If by design, maybe even a 'design bug' ;-). As an application level developer I echo your sentiment :-) I recognize, though, that *nix filesystem semantics were designed many years before unicode and the decision to treat filenames, environment variables, and so much else as bytes follows naturally from the C definition of a char. It's up to a higher level than the OS to decide how to displa6 the bytes. [shell server and fileserver result in this insane PATH] PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80 I would think life would be ultimately easier if either the file server or the shell server automatically translated file names from jis and utf8 and back, so that the PATH on the *nix shell server is entirely utf8. This is not possible because no part of the computer knows what the encoding is. To the computer, it's just a sequence of bytes. Unlike xml or the windows filesystem (winfs? ntfs?) where the encoding is specified as part of the document/filesystem there's nothing to tell what encoding the filenames are in. How would you ever display a mixture to users? This is up to the application. My recomendation would be to keep the raw bytes (to access the file on the filesystem) and display the results of str(filename, errors='replace') to the user. What if there were an ambiguous component that could be legally decoded more than one way? The ambiguity is the reason that the fileserver and shell server can't automatically translate the filename (many encodings merely use all of the 2^8 byte combinations available in a C char type. This makes the byte decodable in any one of those encodings). In the application, only using the raw bytes to access the file also prevents ambiguity because the raw bytes only references one file. Now comes the problematic part. One of the user's on the system wants to write a python3 program that needs to determine if a needed program is in the user's PATH. He tries to code it like this:: #!/usr/bin/python3.0 import os for directory in os.environ['PATH']: programs = os.listdir(directory) That code raises a KeyError because python3 has silently discarded the PATH due to the shift-jis encoded path elements. Much more importantly, there's no way the programmer can handle the KeyError and actually get the PATH from within python. Have you tried os.system or os.popen or the subprocess module to use and get a response from a native *nix command? On Windows Sure, you can subprocess your way out of a lot of sticky situations since you're essentially delegating the task to a C routine. But there are drawbacks: * You become dependent on an external program being available. What happens if your code is run in a chroot, for instance? * Do we want anyone writing programs that access the environment on *NIX to have to discover this pattern themselves and implement it? As for wrapping this up in os.*, that isn't necessary -- the python3 interpreter already knows about the byte-oriented environment; it just isn't making it available to people programming in python. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: On Thu, Dec 4, 2008 at 2:19 PM, Nick Coghlan [EMAIL PROTECTED] wrote: Toshio Kuratomi wrote: The bug report I opened suggests creating a PEP to address this issue. I think that's a good idea for whether os.listdir() and friends should be changed to raise an exception but not having any way to get at some environment variables seems like it's just a bug that needs to be addressed. What do other people think on both these issues? I'm pretty sure the discussion on this topic a while back decided that where necessary Python 3 would grow parallel bytes versions of APIs affected by environmental encoding issues (such as os.environb, os.listdirb, os.getcwdb), but that we were OK with the idea of deferring addition of those APIs until 3.1. It looks like most of them got into 3.0. http://docs.python.org/3.0/library/os.html says All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned. nod I'm very glad this is coming along. Just want to make sure the environment is also handled in 3.1. That is, this was an acknowledged limitation with a fairly straightforward agreed solution, but it wasn't considered a common enough issue to delay the release of 3.0 until all of those parallel APIs had been implemented Aye. IMO it's fairly clear that os.getenv()/os.putenv() should follow suit in 3.1. I'm not so sure about adding os.environb (and making subprocess use it), unless the OP can demonstrate they really need it. Note: subprocess currently uses the real environment (the raw environment as given to the python interpreter) when it is started without the `env` parameter. So the question would be what people overriding the env parameter on their own need to do. To be non-surprising I'd think they'd want to have a way to override just a few variables from the raw environment. Otherwise you have to know which variables the program you're calling relies on and make sure that those are set or call os.getenvb() to retrieve the byte version and add it to your copy of os.environ before passing that to subprocess. One example of something that would be even harder to implement without access to the os.environb dictionary would be writing a program that wraps make. Since make takes all the variables from the environment and transforms them into make variables you need to pass everything from the environment that you are not modifying into the command. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Terry Reedy wrote: Toshio Kuratomi wrote: I would think life would be ultimately easier if either the file server or the shell server automatically translated file names from jis and utf8 and back, so that the PATH on the *nix shell server is entirely utf8. This is not possible because no part of the computer knows what the encoding is. To the computer, it's just a sequence of bytes. Unlike xml or the windows filesystem (winfs? ntfs?) where the encoding is specified as part of the document/filesystem there's nothing to tell what encoding the filenames are in. I thought you said that the file server keep all filenames in shift-jis, and the shell server all in utf-8. Yes. But this is part of the setup of the example to keep things simple. The fileserver or shell server could themselves be of mixed encodings (for instance, if it was serving home directories to users all over the world each user might be using a different encoding.) If so, then the shell server could know if it were told so. Where are you going to store that information? In order for python to run without errors, will it have to be configured on each system it's installed on to know the encoding of each filename? Or are we going to try to talk each *NIX vendor into creating new filesystems that record that information and after a five year span of time declare that python will not run on other filesystems in corner cases? I think that this way does not hold a reasonable expectation of keeping python a portable language. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Victor Stinner wrote: Hi, Le Thursday 04 December 2008 21:02:19 Toshio Kuratomi, vous avez écrit : These mixed encodings can occur for a variety of reasons. Here's an example that isn't too contrived :-) (...) Furthermore, they don't want to suffer from the space loss of using utf-8 to encode Japanese so they use shift-jis everywhere. space loss? Really? If you configure your server correctly, you should get UTF-8 even if the file system is Shift-JIS. But it would be much easier to use UTF-8 everywhere. Hum... I don't think that the discussion is about one specific server, but the lack of bytes environment variables in Python3 :-) Yep. I can't change the logicalness of the policies of a different organization, only code my application to deal with it :-) 1) return mixed unicode and byte types in ... NO! It's nice that we agree... but I would prefer if you leave enough context so that others can see that we agree as well :-) 2) return only byte types in os.environ Hum... Most users have UTF-8 everywhere (eg. all Windows users ;-)), and Python3 already use Unicode everywhere (input(), open(), filenames, ...). We're also in agreement here. 3) silently ignore non-decodable value when accessing os.environ['PATH'] as we do now but allow access to the full information via os.environ[b'PATH'] and os.getenvb() I don't like os.environ[b'PATH']. I prefer to always get the same result type... But os.listdir() doesn't respect that :-( os.listdir(str) - list of str os.listdir(bytes) - list of bytes I would prefer a similar API for easier migration from Python2/Python3 (unicode). os.environb sounds like the best choice for me. nod. After thinking about how it would be used in subprocess calls I agree. os.environb would allow us to retrieve the full dict as bytes. os.environ[b''] only works on individual keys. Also os.getenv serves the same purpose as os.environ[b''] would whereas os.environb would have its own uses. But they are open questions (already asked in the bug tracker): I answered these in the bug tracker. Here are the answers for the mailing list: (a) Should os.environ be updated if os.environb is changed? If yes, how? os.environb['PATH'] = '\xff' (or any invalid string in the system default encoding) = os.environ['PATH'] = ??? The underlying environment that both variables reflect should be updated but what is displayed by os.environ should continue to follow the same rules. So if we follow option #3:: os.environb['PATH'] = b'\xff' os.environ['PATH'] = raises KeyError because PATH is not a key in the unicode decoded environment. (option #4 would issue a UnicodeDecodeError instead of a KeyError) Similarly, if you start with a variable in os.environb that can only be represented as bytes and your program transforms it into something that is decodable it should then show up in os.environ. (b) Should os.environb be updated if os.environ is changed? If yes, how? The problem comes with non-Unicode locale (eg. latin-1 or ASCII): most charset are unable to encode the whole Unicode charset (eg. codes = 65535). os.environ['PATH'] = chr(0x1) = os.environb['PATH'] = ??? Ah, this is a good question. I misunderstood what you were getting at when you posted this to the bug report. I see several options but the one that seems the most sane is to raise UnicodeEncodeError when setting the value. With that, proper code to set an environment variable might look like this:: LANG=C python3.0 variable = chr(0x1) try: # Unicode aware locales os.environ['MYVAR'] = variable except UnicodeEncodeError: # Non-Unicode locales os.environb['MYVAR'] = bytes(variable, encoding='utf8') (c) Same question when a key is deleted (del os.environ['PATH']). Update the underlying env so both os.environ and os.environb reflect the change. Deleting should not hold the problems that updating does. If Python 3.1 will have os.environ and os.environb, I'm quite sure that some modules will user os.environ and other will prefer os.environb. If both environments are differents, the two modules set will work differently :-/ Exactly. So making sure they hold the same information is a priority. It would be maybe easier if os.environ supports bytes and unicode keys. But we have to keep these assertions: os.environ[bytes] - bytes os.environ[str] - str I think the same choices have to be made here. If LANG=C, we still have to decide what to do when os.environ[str] is set to a non-ASCii string. Additionally, the subprocess question makes using the key value undesirable compared with having a separate os.environb that accesses the same underlying data. 4) raise an exception when non-decodable values are *accessed* and continue as in #3. I like os.listdir() behaviour: just *ignore* non-decodable files. If you really want to access
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Guido van Rossum wrote: On Fri, Dec 5, 2008 at 2:27 AM, Ulrich Eckhardt [EMAIL PROTECTED] wrote: In 99% of all cases, using the default encoding will work and do what people expect, which is why I would make this conversion automatic. In all other cases, it will at least not fail silently (which would lead to garbage and data loss) and allow more sophisticated applications to handle it. I think the always fail noisily approach isn't the best approach. E.g. if I am globbing for *.py, and there's an undecodable .txt file in a directory, its presence shouldn't cause the glob to fail. But why should it make glob() fail? This sounds like an implementation detail of glob. Here's some pseudo-code:: def glob(pattern): string = False if isinstance(pattern, str): string = True if platform == 'POSIX': pattern = bytes(pattern, encoding=defaultencoding) rawfiles = os.listdir(os.path.dirname(pattern) or pattern) if string and platform == 'POSIX': return [str(f) for f in rawfiles if match(f, pattern)] else: return rawfiles This way the traceback occurs if anything in the result set is undecodable. What am I missing? -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Guido van Rossum wrote: Glob was just an example. Many use cases for directory traversal couldn't care less if they see *all* files. Okay. Makes it harder to prove correct or not if I don't know what the use case is :-) I can't think of a single use case off-hand. Even your example of a ??.txt file making retrieval of *.py files fail is a little broken. If there was a ??.py file that was undecodable the program would most likely want to know that file existed. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Guido van Rossum wrote: At the risk of bringing up something that was already rejected, let me propose something that follows the path taken in 3.0 for filenames, rather than doubling back: For os.environ, os.getenv() and os.putenv(), I think a similar approach as used for os.listdir() and os.getcwd() makes sense: let os.environ skip variables whose name or value is undecodable, and have a separate os.environb() which contains bytes; let os.getenv() and os.putenv() do the right thing when the arguments passed in are bytes. I prefer the method used by file.read() where an error is thrown when accessing undecodable data. I think in time python programmers will consider not throwing an exception a wart in python3. However, this is enough to allow programmers to do the right thing once an error is reported by users and the cause has been tracked down so it doesn't block fixing errors as the current code does. And it's not like anyone expected python3 to be wart-free just because the python2 warts were fixed ;-) For sys.argv, because it's positional, you can't skip undecodable values, so I propose to use error=replace for the decoding; again, we can add sys.argvb that contains the raw bytes values. The various os.exec*() and os.spawn*() calls (as well as os.system(), os.popen() and the subprocess module) should all accept bytes as well as strings. This also seems sane with the same comment about throwing errors. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Victor Stinner wrote: It would be maybe easier if os.environ supports bytes and unicode keys. But we have to keep these assertions: os.environ[bytes] - bytes os.environ[str] - str I think the same choices have to be made here. If LANG=C, we still have to decide what to do when os.environ[str] is set to a non-ASCii string. If the charset is US-ASCII, os.environ will drop non-ASCII values. But most variables are ASCII only. Examples with my shell: Yes. But you still have the question of what to do when: os.environ[str] = chr(0x1) So I don't think it makes things simpler than having separate os.environ and os.environb that update the same data behind the scenes. Additionally, the subprocess question makes using the key value undesirable compared with having a separate os.environb that accesses the same underlying data. The user should be able to choose bytes or unicode. Examples: the subprocess question was posed further up the thread as basically -- does the user need to access os.environb in order to override things in the environment when calling subprocess? I think the answer to that is yes since you might want to start with your environment and modify it slightly when you call programs via subprocess. If you just try to copy os.environ and os.environ only iterates through the decodable env vars, that doesn't work. If you have an os.environb to copy it becomes possible. - subprocess.Popen('ls') = use unicode environment (os.environ) - subprocess.Popen(b'ls') = use bytes environment (os.environb) That's... not expected to me :-( If I never touch os.environ and invoke subprocess the normal way, I'd still expect the whole environment to be passed on to the program being called. This is how invoking programs manually, shell scripting, invoking programs from perl, python2, etc work. Also, it's not really a good fit with the other things that key off of the initial argument. os.listdir(b'.') changes the output to bytes. subprocess.Popen(b'ls') would change what environment gets input into the call. Here's my problem with it, though. With these semantics any program that works on arbitrary files and runs on *NIX has to check os.listdir(b'') and do the conversion manually. Only programs that have to support strange environment like yours (mixing Shift-JIS and UTF-8) :-) Most programs don't have to support these charset mixture. Any program that is intended to be distributed, accesses arbitrary files, and works on *nix platforms needs to take this into account. Just because the environment inside of my organization is sane doesn't mean that when we release the code to customers, clients, or the free software community that the places it runs will be as strict about these things. Are most programs specific to one organization or are they distributed to other people? I can't answer that... everything I work on (except passwords:-) is distributed -- from sys admin cronjobs to web applications since I'm lucky that my whole job is devoted to working on free software. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan wrote: Toshio Kuratomi wrote: Are most programs specific to one organization or are they distributed to other people? The former. That's pretty well documented in assorted IT literature ('shrink-wrap' and open source commodity software are still relatively new players on the scene that started to shift the balance the other way, but now the server side elements of web services are shifting it back again). Cool. So it's only people writing code to be shared with the larger community or written for multiple customers that are affected by bugs like this. :-/ -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan wrote: Toshio Kuratomi wrote: Guido van Rossum wrote: Glob was just an example. Many use cases for directory traversal couldn't care less if they see *all* files. Okay. Makes it harder to prove correct or not if I don't know what the use case is :-) I can't think of a single use case off-hand. Even your example of a ??.txt file making retrieval of *.py files fail is a little broken. If there was a ??.py file that was undecodable the program would most likely want to know that file existed. Why? Most programs won't be able to do anything with it. And if the program *can* do something with it... that's what the bytes version of the APIs are for. Nonsense. A program can do tons of things with a non-decodable filename. Where it's limited is non-decodable filedata. For instance, if you have a graphical text editor, you need to let the user select files to load. To do that you need to list all the files in a directory, even the ones that aren't decodable. The ones that aren't decodable need to substitute something like: str(filename, errors='replace') + '(Filename not encoded in UTF8)' in the file listing that the user sees. When the file is loaded, it needs to access the actual raw filename. The file can then be loaded and operated upon and even saved back to disk using the raw, undecodable filename. If you have a file manager, you need to code something that let's the user move the file around. Once again, the program loads the raw filenames. It transforms the name into something representable to the user. It displays that. The user selects it and asks that it be moved to another location. Then the program uses the raw filename to move from one location to another. If you have a backup program, you need to list all the files in a directory. Then you need to copy those files to another location. Once again you have to retrieve the byte version of any non-decodable filenames. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan wrote: Toshio Kuratomi wrote: Nonsense. A program can do tons of things with a non-decodable filename. Where it's limited is non-decodable filedata. You can't display a non-decodable filename to the user, hence the user will have no idea what they're working on. Non-filesystem related apps have no business trying to deal with insane filenames. This is where we disagree. There are many ways to display the non-decodable filename to the user because the user is not a machine. The computer must know the unique sequence of bytes in order to access a file. The user, OTOH, usually only needs to know that the file exists. In most GUI-based end-user oriented desktop apps, it's enough to do str(filename, errors='replace'). For instance, the GNOME file manager displays: ? (Invalid encoding) and Konqueror, the KDE file manager just displays: ? The file can still be displayed this way, accessed via the raw bytes that the program keeps internally, and operated upon by applications. For applications in which the user needs more information to differentiate the files the program has the option to display the raw byte sequences as if they were the filename. The *NIX shell and command line tools have this ability. $ LANG=en_US.utf8 ls -b á í $ LANG=C ls -b . .. \303\241 \303\255 $ mv $'\303\241' $'\303\263' $ LANG=C ls -b \303\255 \303\263 $ LANG=en_US.utf8 ls -b í ó Linux is moving towards a standard of UTF-8 for filenames, and once we get to the point where the idea of encoding filenames and environment variables any other way is seen as crazy, then the Python 3 approach will work seamlessly. nod With the caveat that I haven't seen movement by Linux and other Unix variants to enforce UTF-8. What I have seen are statements by kernel programmers that having the filesystem use bytes and not know about encoding is the correct thing to do. This means that utf-8 will be a convention rather than a necessity for a very long time and consequently programs will need to worry about the problems of mixed encoding systems for an equally long time. (Remember, encoding is something that can be changed per user and per file. So on a multiuser OS, mixed encodings can be out of the control of the system administrator for perfectly valid reasons.) In the meantime, raw bytes APIs will provide an alternative for those that disagree with that philosophy. Oh I agree with the UTF-8 everywhere philosophy. I just know that there's tons of real-world systems out there that don't conform to my expectations for sanity and my code has to account for those :-) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Bugbee, Larry wrote: There has been some discussion here that users should use the str or byte function variant based on what is relevant to their system, for example when getting a list of file names or opening a file. That thought process really doesn't do much for those of us that write code that needs to run on any platform type, without alteration or the addition of complex if-statements and/or exceptions. Whatever the resolution here, and those of you addressing this thorny issue have my admiration, the solution should be such that it gives consistent behavior regardless of platform type and doesn't require the programmer to know of all the minute details of each possible target platform. I've been thinking about this and I can only see one option. I don't think that it really makes less work for the programmer, though -- it just shifts the problem and makes it more apparent what your code is doing. To avoid exceptions and if-then's in program code when accessing filenames, environment variables, etc, you would need to access each of these resources via the byte API. Then, to avoid having to keep track of what's a string and what's a byte in your other code, you probably want to convert those bytes to strings. This is where the burden gets shifted. You'll have your own routine(s) to do the conversion and have to have exception handling code to deal with undecodable filenames. Note 1: your particular app might be able to get away without doing the conversion from bytes to string -- it depends on what you're planning on doing with the filename/environment data. Note 2: If there isn't a parallel API on all platforms, for instance, Guido's proposal to not have os.environb on Windows, then you'll still have to have a platform specific check. (Likely you should try to access os.evironb in this instance and if it doesn't exist, use os.environ instead... and remember that you need to either change os.environ's data into str type or change os.environb's data into byte type.) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
[EMAIL PROTECTED] wrote: On 06:07 am, [EMAIL PROTECTED] wrote: Most apps aren't file managers or ftp clients but when they interact with files (for instance, a file selection dialog) they need to be able to show the user all the relevant files. So on an app-by-app basis the need for this is high. While I tend to agree emphatically with this, the *real* solution here is a path-abstraction library. Why don't you send me some information offlist. I'm not sure I agree that a path-abstraction library can work correctly but if it can it would be nice to have that at a level higher than the file-dialog libraries that I was envisioning. [snip] ... but that still doesn't help me identify when someone would expect that asking python for a list of all files in a directory or a specific set of files in a directory should, without warning, return only a subset of them. In what situations is this appropriate behaviour? If you say listdir(unicode) on a POSIX OS, your program is saying I only know how to deal with unicode results from this function, so please only give me those.. No. (explained below) If your program is smart enough to deal with bytes, then you would have asked for bytes, no? Yes (explained below) Returning only filenames which can be properly decoded makes sense. Otherwise everyone needs to learn about this highly confusing issue, even for the simplest scripts. os.listdir(unicode) (currently) means that the *programmer* is asking that the stdlib return the decodable filenames from this directory. The question is whether the programmer understood that this is what they were asking for and whether it is what they most likely want. I would make the following statements WRT to this: 1) The programmer most likely does not want decodable filenames and only decodable filename. If they were, we'd see a lot of python2.x code that turns pathnames into unicode and discards everything that wasn't decodable. No one has given a use case for finding only the *decodable* subset of files. If I request to see all *.py files in a directory, I want to see all of the *.py files in the directory, decodable or not. If you can show how programmers intend 90% of their calls to os.listdir()/glob.glob('*.txt') to show only the decodable subset of the results, then the foundation of my arguments is gone. So please, give examples to prove this wrong. - If this is true, a definition of os.listdir(type 'str') that would better meet programmer expectation would be: Give me all files in a directory with the output as str type. The definition of os.listdir(type 'bytes') would be Give me all files in a directory with the output as bytes type. Raising an exception when the filenames are undecodable is perfectly reasonable in this situation. 2) For the programmer to understand the difference between os.listdir(type 'bytes') and os.listdir(type 'str') they have to understand the highly confusing issue and what it means for their code. So the current method is forcing programmers to understand it even for the simplest scripts if their environment is not uniform with no clue from the interpreter that there is an issue. - Similarly, raising an exception on undecodable values means that the programmer can ignore the issue in any scripts in sane environments and will be told that they need to deal with it (via an exception) when their script runs in a non-sane environment. 3) The usage of unicode vs bytes is easy to miss for someone starting with py2.x or windows and moving to a multi-platform or unix project. Even simple testing won't reveal the problem unless the programmer knows that they have to test what happens when encodings are mixed. Once again, this is requiring the programmer to understand the encoding issue without help from the interpreter. Skipping undecodable values is good enough that it will work 90% of the time. You and Guido have now made this claim to defend not raising an exception but I still don't have a use case. Here are use cases that I see: * Bill is coding an application for use inside his company. His company only uses utf-8. His code naively uses os.listdir(type 'str'). - The code does not throw an exception whether we use the current os.listdir() or one that could throw an exception because the system admins have sanitised the environment. Bill did not need to understand the implications of encoding for his code to work in this script whether simple or complex. * Mary is coding an application for use inside her company. It finds all html files on a system and updates her company's copyright, privacy policy, and other legal boilerplate. Her expectation is that after her program runs every file will have been updated. Her environment is a mixture of different filename encodings due to having many legacy documents for users in different locales. Mary's code also naively uses os.listdir(type 'str'). Her test case checks that the code does
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Guido van Rossum wrote: On Mon, Dec 8, 2008 at 12:07 PM, [EMAIL PROTECTED] wrote: On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote: But I'm happy with just issuing a warning by default. That would mean it doesn't fail silently, but neither does it crash. Seems like the best compromise with the broken nature of the real world IT environment. OK, I can live with that too. Same here. This lets the application specify globally what should happen (exception, warning, ignore via the warnings filters) and should give enough context that it doesn't become a mysterious error in the program. The per method addition of an errors argument so that this isoverridable locally as well as globally is also a nice touch but can be done separately from this step. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
James Y Knight wrote: On Dec 9, 2008, at 6:04 AM, Anders J. Munch wrote: The typical application will just obliviously use os.listdir(dir) and get the default elide-and-warn behaviour for un-decodable names. That rare special application I guess this is a new definition of rare special application: an application which deals with user-specified files. This is the problem I see in having two parallel APIs: people keep saying most applications can just go ahead and use the [broken] unicode string API. If there was a unicode API and a bytes API, but everyone was clear that always use the bytes API is the right thing to do, that'd be okay... But, since even python-dev members are saying that only a rare special app needs to care about working with users' existing files, I'm rather worried this API design will cause most programs written in python to be broken. Which seems a shame. I agree with you which was part of why I raised this subject but I also think that using the warnings module to issue a warning and ignore the entire problematic entry is a reasonable compromise. Hopefully it will become obvious to people that it's a python3 wart at some point in the future and we'll re-examine the default. But until then, having a printed warning that individual apps can turn into an exception seems like it is less broken than the other alternatives the rare special application people can live with :-) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org wrote: Unfortunately, even programmers experienced in I18N like Martin, and those with intuition-that-has-the-force-of-lawwink like Guido, express deliberate disbelief on this point. They say that filesystem names and environment variable values are text, which is true from the semantic viewpoint but can't be fully supported by any implementation. With all the focus on backup tools and file managers I think we've lost perspective. They're an important use case, but hardly the dominant one. Please, as a user, if your app is creating new files, do NOT use bytes! You have no excuse for creating garbage, and garbage doesn't help the user any. Getting the encoding right, use the unicode APIs, and don't pass the buck on to everything else. Uhmmm That's good advice but doesn't solve any problems :-(. No matter what I create, the filenames will be bytes when the next person reads them in. If my locale is shift-js and the person I'm sharing the file with uses utf-8 things won't work. Even if my locale is utf-8 (since I come from a European nation) and their locale is utf-16 (because they're from an Asian nation) the Unicode API won't work. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: A half-broken setup is still a broken setup. Eventually you have to tell people to stop screwing around and pick one encoding. But it's not a broken setup. It's the way the world is because people share things with each other. I doubt that UTF-16 is used very much (other than on windows). I haven't found any statistics on what distros use, but did find this one of the web itself: http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html UTF-16 is popular in Asian locales for the same reason that shift-js and big-5 are hanging in there. utf-8 takes many more bytes to encode Asian Unicode characters than utf-16. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: As a data point, firefox (when pointed at my home dir) DOES skip over garbage files. That's not true. However, it looks like Firefox is actually broken. Take a look at this screenshot: firefox.png That shows a directory with a folder that's not decodable in my utf-8 locale. What's interesting to note is that I actually have two nondecodable folders there but only one of them showed up. So firefox is inconsistent with its treatment, rendering some non-decodable files and ignoring others. Also interesting, if you point your browser at: http://toshio.fedorapeople.org/u/ You should see two other test files. They're both (one-half)(enyei).html but one's encoded in utf-8 and the other in latin-1. Firefox has some bugs in it related to this. For instance, if you mouseover the two links you'll see that firefox displays the same symbolic names for each of the files (even though they're in two different encodings). Sometimes firefox is able to load both files and sometimes it only loads one of them. Firefox seems to be translating the characters from ASCII percent encoding of bytes into their unicode symbols and back to utf-8 in some circumstances related to whether it has the pages in its cache or not. In this case, it should be leaving things as percent encoded bytes as it's the only way that apache is going to know what to retrieve. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: UTF-8 in percent encodings is becoming a defacto standard. Otherwise the browser has to display the percent escapes in the address bar, rather than the intended text. IOW, inconsistent behaviour is a bug, but translating into UTF-8 is not. ;) I think we should let this tangent drop because it's about bugs in firefox bug, not in python :-) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] #Python3 ! ? (was Python Library Support in 3.x)
On Mon, Jun 21, 2010 at 09:57:30AM -0400, Barry Warsaw wrote: On Jun 21, 2010, at 09:37 AM, Arc Riley wrote: Also, under where it mentions that most OS's do not include Python 3, it should be noted which have good support for it. Gentoo (for example) has excellent support for Python 3, automatically installing Python packages which have Py3 support for both Py2 and Py3, and the python-based Portage package system runs cleanly on Py2.6, Py3.1 and Py3.2. We're trying to get there for Ubuntu (driven also by Debian). We have Python 3.1.2 in main for Lucid, though we will probably not get 3.2 into Maverick (the October 2010 release). We're currently concentrating on Python 2.7 as a supported version because it'll be released by then, while 3.2 will still be in beta. If you want to help, or have complaints, kudos, suggestions, etc. for Python support on Ubuntu, you can contact me off-list. nod Fedora 14 is about the same. A nice to have thing that goes along with these would be a table that has packages ported to python3 and which distributions have the python3 version of the package. Once most of the important third party packages are ported to python3 and in the distributions, this table will likely become out-dated and probably should be reaped but right now it's a very useful thing to see. -Toshio pgp4ovCkaMeKl.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package status in 3.X
On Mon, Jun 21, 2010 at 11:43:07AM -0400, Barry Warsaw wrote: On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote: Something that may make sense to ease the porting process is for some of these on the boundary I/O related string manipulation functions (such as os.path.join) to grow encoding keyword-only arguments. The recommended approach would be to provide all strings, but bytes could also be accepted if an encoding was specified. (If you want to mix encodings - tough, do the decoding yourself). This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz for it. Would it make sense to have encoding-carrying bytes and str types? Basically, I'm thinking of types (maybe even the current ones) that carry around a .encoding attribute so that they can be automatically encoded and decoded where necessary. This at least would simplify APIs that need to do the conversion. By default, the .encoding attribute would be some marker to indicated I have no idea, do it explicitly and if you combine ebytes or estrs that have incompatible encodings, you'd either throw an exception or reset the .encoding to IAmConfuzzled. But say you had an email header like: =?euc-jp?b?pc+l7aG8pe+hvKXrpcmhqg==?= And code like the following (made less crappy): -snip snip- class ebytes(bytes): encoding = 'ascii' def __str__(self): s = estr(self.decode(self.encoding)) s.encoding = self.encoding return s class estr(str): encoding = 'ascii' s = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa', 'euc-jp') b = bytes(s, 'euc-jp') eb = ebytes(b) eb.encoding = 'euc-jp' es = str(eb) print(repr(eb), es, es.encoding) -snip snip- Running this you get: b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa' ハローワールド! euc-jp Would it be feasible? Dunno. Would it help ease the bytes/str confusion? Dunno. But I think it would help make APIs easier to design and use because it would cut down on the encoding-keyword function signature infection. I like the idea of having encoding information carried with the data. I don't think that an ebytes type that can *optionally* have an encoding attribute makes the situation less confusing, though. To me the biggest problem with python-2.x's unicode/bytes handling was not that it threw exceptions but that it didn't always throw exceptions. You might test this in python2:: t = u'cafe' function(t) And say, ah my code works. Then a user gives it this:: t = u'café' function(t) And get a unicode error because the function only works with unicode in the ascii range. ebytes seems to have the same pitfall where the code path exercised by your tests could work with:: eb = ebytes(b) eb.encoding = 'euc-jp' function(eb) but the user exercises a code path that does this and fails:: eb = ebytes(b) function(eb) What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``). -Toshio pgpc4qEcxzofr.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 01:08:53AM +0900, Stephen J. Turnbull wrote: Lennart Regebro writes: 2010/6/21 Stephen J. Turnbull step...@xemacs.org: IMO, the UI is right. Something like the above ought to work. Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work with both? First, a caveat: I'm a Unicode/encodings person, not an experienced web programmer. My opinions on whether this would work well in practice should be taken with a grain of salt. Speaking for myself, I live in a country where the natives have saddled themselves with no less than 4 encodings in common use, and I would never want binary since none of them would display as anything useful in a traceback. Wherever possible, I decode blobs into structured objects, I do it as soon as possible, and if for efficiency reasons I want to do this lazily, I store the blob in a separate .raw_object attribute. If they're textual, I decode them to text. I can't see an efficiency argument for decoding URIs lazily in most applications. In the case of structured text like URIs, I would create a separate class for handling them with string-like operations. Internally, all text would be raw Unicode (ie, not url-encoded); repr(uri) would use some kind of readable quoting convention (not url-encoding) to disambiguate random reserved characters from separators, while str(uri) would produce an url-encoded string. Converting to and from wire format is just .encode and .decode, then, and in this country you need to be flexible about which encoding you use. Agreed, this stuff is really annoying. But I think that just comes with the territory. PJE reports that folks don't like doing encoding and decoding all over the place. I understand that, but if they're doing a lot of that, I have to wonder why. Why not define the one line function and get on with life? The thing is, where I live, it's not going to be a one line function. I'm going to be dealing with URLs that are url-encoded representations of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047! So I need an API that explicitly encodes and decodes. And I need an API that presents Japanese as Japanese rather than as line noise. Eg, PJE writes Ugh. I meant: newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1') Which just goes to the point of how ridiculous it is to have to convert things to strings and back again to use APIs that ought to just handle bytes properly in the first place. But if you need that everywhere, what's so hard about def urljoin_wrapper (base, subdir): return urljoin(str(base, 'latin-1'), subdir).encode('latin-1') Now, note how that pattern fails as soon as you want to use non-ISO-8859-1 languages for subdir names. In Python 3, the code above is just plain buggy, IMHO. The original author probably will never need the generalization. But her name will be cursed unto the nth generation by people who use her code on a different continent. The net result is that bytes are *not* a programmer- or user-friendly way to do this, except for the minority of the world for whom Latin-1 is a good approximation to their daily-use unibyte encoding (eg, it's probably usable for debugging in Dansk, but you won't win any popularity contests in Tel Aviv or Shanghai). One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual representation that was intended will be garbled (or be represented by escaped byte sequences). For that matter, apache will serve requests that have no true textual representation as it is working on the byte level rather than the character level. So a complete solution really should allow the programmer to pass in uris as bytes when the programmer knows that they need it. -Toshio pgpAvx546YBxD.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package status in 3.X
On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote: At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote: What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``). As long as the coercion rules force str+ebytes (or str % ebytes, ebytes % str, etc.) to result in another ebytes (and fail if the str can't be encoded in the ebytes' encoding), I'm personally fine with it, although I really like the idea of tacking the encoding to bytes objects in the first place. I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback. The coercion must end up with a str and no traceback (this assumes that we've checked that the ebyte and the encoding match when we create the ebyte). If you want bytes out the other end, you should either have a different function or explicitly transform the output from str to bytes. So, what's the advantage of using ebytes instead of bytes? * It keeps together the text and encoding information when you're taking bytes in and want to give bytes back under the same encoding. * It takes some of the boilerplate that people are supposed to do (checking that bytes are legal in a specific encoding) and writes it into the initialization of the object. That forces you to think about the issue at two points in the code: when converting into ebytes and when converting out to bytes. For data that's going to be used with both str and bytes, this is the accepted best practice. (For exceptions, the byte type remains which you can do conversion on when you want to). -Toshio pgpjsqwszNbF7.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package status in 3.X
On Mon, Jun 21, 2010 at 02:46:57PM -0400, P.J. Eby wrote: At 02:58 AM 6/22/2010 +0900, Stephen J. Turnbull wrote: Nick alluded to the The One Obvious Way as a change in architecture. Specifically: Decode all bytes to typed objects (str, images, audio, structured objects) at input. Do no manipulations on bytes ever except decode and encode (both to text, and to special-purpose objects such as images) in a program that does I/O. This ignores the existence of use cases where what you have is text that can't be properly encoded in unicode. I know, it's a hard thing to wrap one's head around, since on the surface it sounds like unicode is the programmer's savior. Unfortunately, real-world text data exists which cannot be safely roundtripped to unicode, and must be handled in bytes with encoding form for certain operations. I personally do not have to deal with this *particular* use case any more -- I haven't been at NTT/Verio for six years now. But I do know it exists for e.g. Asian language email handling, which is where I first encountered it. At the time (this *may* have changed), many popular email clients did not actually support unicode, so you couldn't necessarily just send off an email in UTF-8. It drove us nuts on the project where this was involved (an i18n of an existing Python app), and I think we had to compromise a bit in some fashion (because we couldn't really avoid unicode roundtripping due to database issues), but the use case does actually exist. My current needs are simpler, thank goodness. ;-) However, they *do* involve situations where I'm dealing with *other* encoding-restricted legacy systems, such as software for interfacing with the US Postal Service that only works with a restricted subset of latin1, while receiving mangled ASCII from an ecommerce provider, and storing things in what's effectively a latin-1 database. Being able to easily assert what kind of bytes I've got would actually let me catch errors sooner, *if* those assertions were being checked when different kinds of strings or bytes were being combined. i.e., at coercion time). While it's certainly possible that you have a grapheme that has no corresponding unicode codepoint, it doesn't sound like this is the case you're dealing with here. You talk about restricted subset of latin1 but all of latin1's graphemes have unicode codepoints. You also talk about not being able to send off an email in UTF-8 but UTF-8 is an encoding of unicode, not unicode itself. Similarly, the statement that some email clients don't support unicode isn't very clear as to actual problem. The email client supports displaying graphemes using glyphs present on the computer. As long as the graphemes needed have a unicode codepoint, using unicode inside of your application and then encoding to bytes on the way out works fine. Even in cases where there's no unicode codepoint for the grapheme that you're receiving unicode gives you a way out. It provides you a private use area where you can map the graphemes to unused codepoints. Your application keeps a mapping from that codepoint to the particular byte sequence that you want. Then write you a codec that converts from unicode w/ these private codepoints into your particular encoding (and from bytes into unicode). -Toshio pgp0riTqgpAbp.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package status in 3.X
On Mon, Jun 21, 2010 at 04:09:52PM -0400, P.J. Eby wrote: At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote: On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote: At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote: What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``). As long as the coercion rules force str+ebytes (or str % ebytes, ebytes % str, etc.) to result in another ebytes (and fail if the str can't be encoded in the ebytes' encoding), I'm personally fine with it, although I really like the idea of tacking the encoding to bytes objects in the first place. I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback. For stdlib functions, this isn't going to happen unless your ebytes' encoding is not compatible with the ascii subset of unicode, or the stdlib function is working with dynamic data... in which case you really *do* want to fail early! The ebytes encoding will often be incompatible with the ascii subset. It's the reason that people were so often tempted to change the defaultencoding on python2 to utf8. I don't see this as a repeat of the 2.x situation; rather, it allows you to cause errors to happen much *earlier* than they would otherwise show up if you were using unicode for your encoded-bytes data. For example, if your program's intent is to end up with latin-1 output, then it would be better for an error to show up at the very *first* point where non-latin1 characters are mixed with your data, rather than only showing up at the output boundary! That highly depends on your usage. If you're formatting a comment on a web page, checking at output and replacing with '?' is better than a traceback. If you're entering key values into a database, then you likely want to know where the non-latin1 data is entering your program, not where it's mixed with your data or the output boundary. However, if you promoted mixed-type operation results to unicode instead of ebytes, then you: 1) can't preserve data that doesn't have a 1:1 mapping to unicode, and ebytes should be immutable like bytes and str. So you shouldn't lose the data if you keep a reference to it. 2) can't detect an error until your data reaches the output point in your application -- forcing you to defensively insert ebytes calls everywhere (vs. simply wrapping them around a handful of designated inputs), or else have to go right back to tracing down where the unusable data showed up in the first place. Usually, you don't want to know where you are combining two incompatible strings. Instead, you want to know where the incompatible strings are being set in the first place. If function(a, b) tracebacks with certain combinations of a and b I need to know where a and b are being set, not where function(a, b) is in the source code. So you need to be making input values ebytes() (or str in current python3) no matter what. One thing that seems like a bit of a blind spot for some folks is that having unicode is *not* everybody's goal. Not because we don't believe unicode is generally a good thing or anything like that, but because we have to work with systems that flat out don't *do* unicode, thereby making the presence of (fully-general) unicode an error condition that has to be stamped out! I think that sometimes as well. However, here I think you're in a bit of a blind spot yourself. I'm saying that making ebytes + str coerce to ebytes will only yield a traceback some of the time; which is the python2 behaviour. Having ebytes + str coerce to str will never throw a traceback as long as our implementation checks that the bytes and encoding work together fro mthe start. Throwing an error in code, only on some input is one of the main reasons that debugging unicode vs byte issues sucks on python2. On my box, with my dataset, everything works. Toss it up on pypi and suddenly I have a user in Japan who reports that he gets a traceback with his dataset that he can't give to me because it's proprietary, overly large, or transient. IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug. This is not always true. If you read a webpage, chop it up so you get a list of words, create a histogram of word length, and then write the output as utf8 to a database. Should you do all your intermediate string operations on utf8 encoded byte strings? No, you should do them on unicode strings as otherwise you need to know about the details of how utf8 encodes characters. And as it really *is* an error in that case, it should not pass silently
Re: [Python-Dev] email package status in 3.X
On Mon, Jun 21, 2010 at 04:52:08PM -0500, John Arbash Meinel wrote: ... IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug. This is not always true. If you read a webpage, chop it up so you get a list of words, create a histogram of word length, and then write the output as utf8 to a database. Should you do all your intermediate string operations on utf8 encoded byte strings? No, you should do them on unicode strings as otherwise you need to know about the details of how utf8 encodes characters. You'd still have problems in Unicode given stuff like å =~ å even though u'\xe5' vs u'a\u030a' (those will look the same depending on your Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw with my current font shows the second as 2 characters.) I realize this was a toy example, but it does point out that Unicode complicates the idea of 'equality' as well as the idea of 'what is a character'. And just saying decode it to Unicode isn't really sufficient. Ah -- but if you're dealing with unicode objects you can use the unicodedata.normalize() function on them to come out with the right values. If you're using bytes, it's yet another case where you, the programmer, have to know what byte sequences represent combining characters in the particular encoding that you're dealing with. -Toshio pgpF7cCCZvokU.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote: Toshio Kuratomi writes: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual representation that was intended will be garbled (or be represented by escaped byte sequences). For that matter, apache will serve requests that have no true textual representation as it is working on the byte level rather than the character level. Sure. I've never seen that combination, but I have seen Shift JIS and KOI8-R in the same path. But in that case, just using 'latin-1' as the encoding allows you to use the (unicode) string operations internally, and then spew your mess out into the world for someone else to clean up, just as using bytes would. This is true. I'm giving this as a real-world counter example to the assertion that URIs are text. In fact, I think you're confusing things a little by asserting that the RFC says that URIs are text. I'll address that in two sections down. So a complete solution really should allow the programmer to pass in uris as bytes when the programmer knows that they need it. Other than passing bytes into a constructor, I would argue if a complete solution requires, eg, an interface that allows urljoin(base,subdir) where the types of base and subdir are not required to match, then it doesn't belong in the stdlib. For stdlib usage, that's premature optimization IMO. I'll definitely buy that. Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? (I think, given other options, I'd rather see two separate functions, though. It seems more discoverable and less prone to taking bad input some of the time to have two functions that clearly only take one type of data apiece.) The RFC says that URIs are text, and therefore they can (and IMO should) be operated on as text in the stdlib. If I'm reading the RFC correctly, you're actually operating on two different levels here. Here's the section 2 that you quoted earlier, now in its entirety:: 2. Characters The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text. The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must invert that relation in order to understand the URI syntax. Therefore, the integer values used by the ABNF must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules. A URI is composed from a limited set of characters consisting of digits, letters, and a few graphic symbols. A reserved subset of those characters may be used to delimit syntax components within a URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each component's identifying data. So here's some data that matches those terms up to actual steps in the process:: # We start off with some arbitrary data that defines a resource. This is # not necessarily text. It's the data from the first sentence: data = b\xff\xf0\xef\xe0 # We encode that into text and combine it with the scheme and host to form # a complete uri. This is the URI characters mentioned in section #2. # It's also the sequence of characters mentioned in 1.1 as it is not # until this point that we actually have a URI. uri = bhttp://host/; + percentencoded(data) # # Note1: percentencoded() needs to take any bytes or characters outside of # the characters listed in section 2.3 (ALPHA / DIGIT / - / . / _ # / ~) and percent encode them. The URI can only consist of characters # from this set and the reserved character set (2.2). # # Note2: in this simplistic example, we're only dealing with one piece of # data. With multiple pieces, we'd need to combine them with separators, # for instance like this: # uri = b'http://host/' + percentencoded(data1) + b'/' # + percentencoded(data2) # # Note3: at this point, the uri could be stored as unicode or bytes in # python3. It doesn't matter. It will be a subset of ASCII in either # case. # Then we
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 08:31:13PM +0900, Stephen J. Turnbull wrote: Toshio Kuratomi writes: unicode handling redesign. I'm stating my reading of the RFC not to defend the use case Philip has, but because I think that the outlook that non-text uris (before being percentencoded) are violations of the RFC That's not what I'm saying. What I'm trying to point out is that manipulating a bytes object as an URI sort of presumes a lot about its encoding as text. I think we're more or less in agreement now but here I'm not sure. What manipulations are you thinking about? Which stage of URI construction are you considering? I've just taken a quick look at python3.1's urllib module and I see that there is a bit of confusion there. But it's not about unicode vs bytes but about whether a URI should be operated on at the real URI level or the data-that-makes-a-uri level. * all functions I looked at take python3 str rather than bytes so there's no confusing stuff here * urllib.request.urlopen takes a strict uri. That means that you must have a percent encoded uri at this point * urllib.parse.urljoin takes regular string values * urllib.parse and urllib.unparse take regular string values Since many of the URIs we deal with are more or less textual, why not take advantage of that? Cool, so to summarize what I think we agree on: * Percent encoded URIs are text according to the RFC. * The data that is used to construct the URI is not defined as text by the RFC. * However, it is very often text in an unspecified encoding * It is extremely convenient for programmers to be able to treat the data that is used to form a URI as text in nearly all common cases. -Toshio pgpDvecDxPAjV.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 09:36:45PM +0200, Antoine Pitrou wrote: On Wed, 23 Jun 2010 14:23:33 -0400 Tres Seaver tsea...@palladion.com wrote: - - the slow adoption / porting rate of major web frameworks and libraries to Python 3. Some of the major web frameworks and libraries have a ton of dependencies, which would explain why they really haven't bothered yet. I don't think you can't claim, though, that Python 3 makes things significantly harder for these frameworks. The proof is that many of them already give the user unicode strings in Python 2.x. They must have somehow got the decoding right. Note that this assumption seems optimistic to me. I started talking to Graham Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste do decoding of bytes to unicode at different layers which caused problems for application level code that should otherwise run fine when being served by mod_wsgi or paste httpserver. That was the beginning of Graham starting to talk about what the wsgi spec really should look like under python3 instead of the broken way that the appendix to the current wsgi spec states. -Toshio pgpRSbaUGJzcz.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 11:35:12PM +0200, Antoine Pitrou wrote: On Wed, 23 Jun 2010 17:30:22 -0400 Toshio Kuratomi a.bad...@gmail.com wrote: Note that this assumption seems optimistic to me. I started talking to Graham Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste do decoding of bytes to unicode at different layers which caused problems for application level code that should otherwise run fine when being served by mod_wsgi or paste httpserver. That was the beginning of Graham starting to talk about what the wsgi spec really should look like under python3 instead of the broken way that the appendix to the current wsgi spec states. Ok, but the reason would be that the WSGI spec is broken. Not Python 3 itself. Agreed. Neither python2 nor python3 is broken. It's the wsgi spec and the implementation of that spec where things fall down. From your first post, I thought you were claiming that python3 was broken since web frameworks got decoding right on python2 and I just wanted to defend python3 by showing that python2 wasn't all sunshine and roses. -Toshio pgp8xQXfAPrYT.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Licensing
On Tue, Jul 06, 2010 at 10:10:09AM +0300, Nir Aides wrote: I take ...running off with the good stuff and selling it for profit to mean creating derivative work and commercializing it as proprietary code which you can not do with GPL licensed code. Also, while the GPL does not prevent selling copies for profit it does not make it very practical either. Uhmmm http://finance.yahoo.com/q/is?s=RHTannual It is very possible to make money with the GPL. The GPL does, as you say, prevents you from creating derivative works that are proprietary code. It does *not* prevent you from creating derivative works and commercializing it. -Toshio pgpInicmKNFs3.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fixing #7175: a standard location for Python config files
On Fri, Aug 13, 2010 at 07:48:22AM +1000, Nick Coghlan wrote: 2010/8/12 Éric Araujo mer...@netwok.org: Choosing an arbitrary location we think is good on every system is fine and non risky I think, as long as Python let the various distribution change those paths though configuration. Don’t you have a bootstrapping problem? How do you know where to look at the sysconfig file that tells where to look at config files? I'd hardcode a list of locations. [os.path.join(os.path.dirname(__file__), 'sysconfig.cfg'), os.path.join('/etc', 'sysconfig.cfg')] The distributor has a limited choice of options on where to look. A good alternative would be to make the config file overridable. That way you can have sysconfig.cfg next to sysconfig.py or in a known config directory relative to the python stdlib install but also let the distributions and individual sites override the defaults by making changes to /etc/python3/sysconfig.cfg for instance. Personally, I'm not clear on what a separate syconfig.cfg file offers over clearly separating the directory configuration settings and continuing to have distributions patch sysconfig.py directly. The bootstrapping problem (which would encourage classifying synconfig.cfg as source code and placing it alongside syscongig.py) is a major part of that point of view. Here's some advantages but some of them are of dubious worth: * Allows users/site-administrators to change paths and not have packaging systems overwrite the changes. * Makes it conceptually cleaner to make this overridable via user defined config files since it's now a matter of parsing several config files instead of having a hardcoded value in the file and overridable values outside of it. * Allows sites to add additional paths to the config file. * Makes it clear to distributions that the values in the config file are available for making changes to rather than having to look for it in code and not know the difference between thaat or say, the encoding parameter in python2. * Documents the format to use for overriding the paths if individual sites can override the defaults that are shipped in the system version of python. -Toshio pgpBEZ2XsDBy9.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] (Not) delaying the 3.2 release
On Thu, Sep 16, 2010 at 09:52:48AM -0400, Barry Warsaw wrote: On Sep 16, 2010, at 11:28 PM, Nick Coghlan wrote: There are some APIs that should be able to handle bytes *or* strings, but the current use of string literals in their implementation means that bytes don't work. This turns out to be a PITA for some networking related code which really wants to be working with raw bytes (e.g. URLs coming off the wire). Note that email has exactly the same problem. A general solution -- even if embodied in *well documented* best-practices and convention -- would really help make the stdlib work consistently, and I bet third party libraries too. I too await a solution with abated breath :-) I've been working on documenting best practices for APIs and Unicode and for this type of function (take bytes or unicode and output the same type), knowing the encoding is seems like a requirement in most cases: http://packages.python.org/kitchen/designing-unicode-apis.html#take-either-bytes-or-unicode-output-the-same-type I'd love to add another strategy there that shows how you can robustly operate on bytes without knowing the encoding but from writing that, I think that anytime you simplify your API you have to accept limitations on the data you can take in. (For instance, some simplifications can handle anything except ASCII-incompatible encodings). -Toshio pgpAJSHDGRHtD.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] (Not) delaying the 3.2 release
On Thu, Sep 16, 2010 at 10:56:56AM -0700, Guido van Rossum wrote: On Thu, Sep 16, 2010 at 10:46 AM, Martin (gzlist) gzl...@googlemail.com wrote: On 16/09/2010, Guido van Rossum gu...@python.org wrote: In all cases I can imagine where such polymorphic functions make sense, the necessary and sufficient assumption should be that the encoding is a superset of 7-bit(*) ASCII. This includes UTF-8, all Latin-N variant, and AFAIK also the popular CJK encodings other than UTF-16. This is the same assumption made by Python's byte type when you use character-based methods like lower(). Well, depends on what exactly you're doing, it's pretty easy to go wrong: Python 3.2a2+ (py3k, Sep 16 2010, 18:43:45) [MSC v.1500 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. import os, sys os.path.split(C:\\十) ('C:\\', '十') os.path.split(C:\\十.encode(sys.getfilesystemencoding())) (b'C:\\\x8f', b'') Similar things can catch out web developers once they step outside the percent encoding. Well, that character is not 7-bit ASCII. Of course things will go wrong there. That's the whole point of what I said, isn't it? You were talking about encodings that were supersets of 7-bit ASCII. I think Martin was demonstrating a byte string that was a superset of 7-bit ASCII being fed to a stdlib function which went wrong. -Toshio pgpTUIwKWOepG.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] We should be using a tool for code reviews
On Wed, Sep 29, 2010 at 01:23:24PM -0700, Guido van Rossum wrote: On Wed, Sep 29, 2010 at 1:12 PM, Brett Cannon br...@python.org wrote: On Wed, Sep 29, 2010 at 12:03, Guido van Rossum gu...@python.org wrote: A problem with that is that we regularly make matching improvements to upload.py and the server-side code it talks to. While we tend to be conservative in these changes (because we don't control what version of upload.py people use) it would be a pain to maintain backwards compatibility with a version that was distributed in Misc/ two years ago -- that's kind of outside our horizon. Well, I would assume people are working from a checkout. Patches from an outdated checkout simply would fail and that's fine by me. Ok, but that's an extra barrier for contributions. Lots of people when asked for a patch just modify their distro in place and you can count yourself lucky if they send you a diff from a clean copy. But maybe with Hg it's less of a burden to ask people to use a checkout. How often do we even get patches generated from a downloaded copy of Python? Is it enough to need to worry about this? I used to get these frequently. I don't know what the experience of the current crop of core developers is though, so maybe my gut feelings here are outdated. When helping out on a Linux distribution, dealing with patches against the latest tarball is a fairly frequent occurrence. The question would be whether these patches get filtered through the maintainer of the package before landing in roundup/rietveld and whether the distro maintainer is sufficiently in tune with python development that they're maintaining both patches against the last tarball and a checkout of trunk with the patches applied intelligently there. A few other random thoughts: * hg could be more of a burden in that it may be unfamiliar to the casual python user who happens to have found a fix for a bug and wants to submit it. cvs and svn are similar enough that people comfortable with one are usually comfortable with the other but hg has different semantics. * The barrier to entry seems to be higher the less well integrated the tools are. I occassionally try to contribute patches to bzr in launchpad and the integration there is horrid. You end up with two separate streams of comments and you don't automatically get subscribed to both. There's several UI elements for associating a branch with a bug but some of them are buggy (or else are very strict on what input they're expecting) while other ones are hard to find. Since I only contribute a patch two or three times a year, I have to re-figure out the process each time I try to contribute. * I like the idea of patch complexity being a measure of whether the patch needs to go into a code review tool in that it keeps simple things simple and gives more advanced tools to more advanced cases. I dislike it in that for someone who's just contributing a patch to fix a problem that they're encountering which happens to be somewhat complex, they end up having to learn a lot about tools that they may never use again. * It seems like code review will be a great aid to people who submit changes or review changes frequently. The trick will be making it non-intimidating for someone who's just going to contribute changes infrequently. -Toshio pgpaYtl9m5J7d.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Distutils2 scripts
On Fri, Oct 08, 2010 at 10:26:36AM -0400, Barry Warsaw wrote: On Oct 08, 2010, at 03:22 PM, Tarek Ziadé wrote: Yes that what I was thinking about -- I am not too worried about this, since every Linux deals with the 'more than one python installed' case. Kind of. wink but anyway... I'm in favor of add a top-level setup module that can be invoked using python -m setup There will be three cases: Nice idea ! I wouldn't call it setup though, since it does many other things. I can't think of a good name yet, but I'd like such a script to express the idea that it can be used to: I like 'python -m setup' too. It's a small step from the familiar thing (python setup.py) to the new and shiny thing, without being confusing. And you won't have to worry about things like version numbers because the Python executable will already have that baked in. - query pypi - browse what's installed - install/remove projects - create releases and upload them pkg_manager ? No underscores, please. :) Actually, a decent wrapper script could just be called 'setup'. My command-not-found on Ubuntu doesn't find a collision, or even close similarities. Simple English names like this are almost never a good idea for commands. A quick google for /usr/bin/setup finds that Fedora-derived distros have a /usr/bin/setup as a wrapper for all the text-mode configuration tools. And there's a derivative of opensolaris that has a /usr/bin/setup for configuring the system the first time. I still like 'egg' as a command too. There are no collisions that I can see. I know this has been thrown around for years, and it's always been rejected because I think setuptools wanted to claim it, but since it still doesn't exist afaict, distutils2 could easily use it. There's a 2D graphics library that provides a /usr/bin/egg command: http://www.ir.isas.jaxa.jp/~cyamauch/eggx_procall/ Latest Stable Version 0.93r3 (released 2010/4/14) In the larger universe of programs, it might make for more intuitive remembering of the command to use a prefix (either py or python) though. python-setup is a lot like python setup.py pysetup is shorter pyegg is even shorter :-) -Toshio pgpVyH77xDEyw.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Distutils2 scripts
On Fri, Oct 08, 2010 at 05:12:44PM +0200, Antoine Pitrou wrote: On Fri, 8 Oct 2010 11:04:35 -0400 Toshio Kuratomi a.bad...@gmail.com wrote: In the larger universe of programs, it might make for more intuitive remembering of the command to use a prefix (either py or python) though. python-setup is a lot like python setup.py pysetup is shorter pyegg is even shorter :-) Wouldn't quiche be a better alternative for pyegg? I won't bikeshed as long as we stay away from conflicting names. -Toshio pgpk9LAmigC2q.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] My work on Python3 and non-ascii paths is done
On Thu, Oct 21, 2010 at 12:00:40PM -0400, Barry Warsaw wrote: On Oct 20, 2010, at 02:11 AM, Victor Stinner wrote: I plan to fix Python documentation: specify the encoding used to decode all byte string arguments of the C API. I already wrote a draft patch: issue #9738. This lack of documentation was a big problem for me, because I had to follow the function calls to get the encoding. This will be truly excellent! That's exactly what I was looking for! Thanks. I think you've learned a huge amount of good information that's difficult to find, so writing it up in a more permanent and easy to find location will really help future Python developers! One further thing I'd be interested in is if you could document any best practices from this experience. Things like, surrogateescape is a good/bad default in these cases, When is parallel functions for bytes and str better than a single polymorphic function? That way when other modules are added to the stdlib, things can be more consistent. -Toshio pgp6M2nRKwOkl.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Continuing 2.x
On Fri, Oct 29, 2010 at 11:12:28AM -0700, geremy condra wrote: On Thu, Oct 28, 2010 at 11:55 PM, Glyph Lefkowitz Let's take PyPI numbers as a proxy. There are ~8000 packages with a Programming Language::Python classifier. There are ~250 with Programming Langauge::Python::3. Roughly speaking, we can say that is 3% of Python code which has been ported so far. Python 3.0 was released at the end of 2008, so people have had roughly 2 years to port, which comes up with 1.5% per year. Just my two cents: Just one further informational note about using pypi in this way for statistics... In the porting work we've done within Fedora, I've noticed that a lot of packages are python3 ready or even officially support python3 but the language classifier on pypi does not reflect this. Here's just a few since I looked them up when working on the python porting wiki pages: http://pypi.python.org/pypi/Beaker/ http://pypi.python.org/pypi/pycairo http://pypi.python.org/pypi/docutils -Toshio pgphZAiUVGy6C.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Breaking undocumented API
On Tue, Nov 09, 2010 at 11:46:59AM +1100, Ben Finney wrote: Ron Adam r...@ronadam.com writes: def _publicly_documented_private_api(): Not sure why you would want to do this instead of using comments. ... Because the docstring is available at the interpreter via ‘help()’, and because it's automatically available to ‘doctest’, and most of the other good reasons for docstrings. The _publicly_documented_private_api() is a problem because people *will* use it even though it has a leading underscore. Especially those who are new to python. That isn't an argument against docstrings, since the problem you describe isn't dependent on the presence or absence of docstrings. Just wanted to expand a bit here: as a general practice, you may be involved in a project where the _private_api() is not intended by people outside of the project but is intended to be used in multiple places within the project. If you have different people working on those different areas, it can be very useful for them to be able to use help(_private_api) on the other functions from within the interpreter shell. -Toshio pgpG39YJbm42M.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Breaking undocumented API
On Tue, Nov 09, 2010 at 01:49:01PM -0500, Tres Seaver wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/08/2010 06:26 PM, Bobby Impollonia wrote: This does hurt because anyone who was relying on import * to get a name which is now omitted from __all__ is going to upgrade and find their program failing with NameErrors. This is a backwards compatible change and shouldn't happen without a deprecation warning first. Outside an interactive prompt, anyone using from foo import * has set themselves and their users up to lose anyway. That syntax is the single worst misfeature in all of Python. It impairs readability and discoverability for *no* benefit beyond one-time typing convenience. Module writers who compound the error by expecting to be imported this way, thereby bogarting the global namespace for their own purposes, should be fish-slapped. ;) I think there's a valid case for bogarting the namespace in this instance, but let me know if there's a better way to do it:: # Method to use system libraries if available, otherwise use a bundled copy, # aka: make both system packagers and developers happy:: Relevant directories and files for this module:: + foo/ +- __init__.py ++ compat/ +- __init__.py ++ bar/ +- __init__.py +- _bar.py foo/compat/bar/_bar.py is a bundled module. foo/compat/bar/__init__.py has: try: from bar import * from bar import __all__ except ImportError:: from foo.compat.bar._bar import * from foo.compat.bar._bar import __all__ -Toshio pgp2MughtFdu4.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Porting Ideas
On Wed, Dec 01, 2010 at 10:06:24PM -0500, Alexander Belopolsky wrote: On Wed, Dec 1, 2010 at 9:53 PM, Terry Reedy tjre...@udel.edu wrote: .. Does Sphinx run on PY3 yet? It does, but see issue10224 for details. http://bugs.python.org/issue10224 Also, docutils has an unported module. /me needs to write a bug report for that as he really doesn't have the time he thought he did to perform the port. -Toshio pgplgIh22rxh1.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 384 accepted
On Fri, Dec 03, 2010 at 11:52:41PM +0100, Martin v. Löwis wrote: Am 03.12.2010 23:48, schrieb Éric Araujo: But I'm not interested at all in having it in distutils2. I want the Python build itself to use it, and alas, I can't because of the freeze. You can’t in 3.2, true. Neither can you in 3.1, or any previous version. If you implement it in distutils2, you have very good chances to get it for 3.3. Isn’t that a win? It is, unfortunately, a very weak promise. Until distutils2 is integrated in Python, I probably won't spend any time on it. At the language summit it was proposed and seemed generally accepted (maybe I took silence as consent... it's been almost a year now) that bold new modules (and bold rewrites of existing modules since it fell out of the distutils/2 discussion) should get implemented in a module on pypi before being merged into the python stdlib. If you wouldn't want to work on any of those modules until they were actually integrated into Python, it sounds like you disagree with that as a general practice? -Toshio pgpBIM4lN9FET.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 04:40:24PM -0500, Terry Reedy wrote: On 1/19/2011 4:05 PM, Simon Cross wrote: I have no problem with non-ASCII module identifiers being valid syntax. It's a question of whether attempting to translate a non-ASCII If the names are the same, ie, produced with the same sequence of keystrokes in the save-as box and importing box, then there is no translation, at least from the user's view. module name into a file name (so the file can be imported) is a good idea and whether these sorts of files can be safely transferred among diverse filesystems. I believe we now have the situation that a package that works on *nix could fail on Windows, whereas I believe that patch would *improve* portability. I'm not so sure about this You may have something that works on Windows and on *NIX under certain circumstances but it seems likely to fail when moving files between them (for instance, as packages downloaded from pypi). Additionally, many unix filesystem don't specify a filesystem encoding for filenames; they deal in legal and illegal bytes which could lead to troubles. This problem of which encoding to use is a problem that can be seen on UNIX systems even now. Try this: echo 'print(hi)' café.py convmv -f utf-8 -t latin1 café.py python3 -c 'import café' ASCII seems very sensible to me when faced with these ambiguities. Other options I can brainstorm that could be explored: * Specify an encoding per platform and stick to that. (So, for instance, all module names on posix platforms would have to be utf-8). Force translation between encoding when installing packages (But that doesn't help for people that are creating their modules using their own build scripts rather than distutils, copying the files using raw tar, etc.) * Change import semantics to allow specifying the encoding of the module on the filesystem (seems really icky). -Toshio pgpsh1AqAY9Vd.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 07:11:52PM -0500, James Y Knight wrote: On Jan 19, 2011, at 6:44 PM, Toshio Kuratomi wrote: This problem of which encoding to use is a problem that can be seen on UNIX systems even now. Try this: echo 'print(hi)' café.py convmv -f utf-8 -t latin1 café.py python3 -c 'import café' ASCII seems very sensible to me when faced with these ambiguities. Other options I can brainstorm that could be explored: * Specify an encoding per platform and stick to that. (So, for instance, all module names on posix platforms would have to be utf-8). Force translation between encoding when installing packages (But that doesn't help for people that are creating their modules using their own build scripts rather than distutils, copying the files using raw tar, etc.) * Change import semantics to allow specifying the encoding of the module on the filesystem (seems really icky). None of this is unique to import -- the same exact issue occurs with open(u'café'). I don't see any reason why import café should be though of as more of a problem, or treated any differently. It's unique in several ways: 1) With open, you can specify a byte string:: open(b'caf\xe9.py').read() I don't know of any way to do that with import. This is needed when the filename is not compatible with your current locale. 2) import assigns a name to the module that it imports whereas open lets the programmer assign the name. So even if you can specify what to use as a byte string for this filename on this particular filesystem you'd still end up with some ugly pseudo-representation of bytes when attempting to access it in code:: import caf\xe9 caf\xe9.do_something() -Toshio pgp3UpXl83i8t.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 01:26:01AM +0100, Victor Stinner wrote: Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : Additionally, many unix filesystem don't specify a filesystem encoding for filenames; they deal in legal and illegal bytes which could lead to troubles. This problem of which encoding to use is a problem that can be seen on UNIX systems even now. If the system is not correctly configured, it is not a bug in Python, but a bug in the system config. Python relies on the locale to choose the filesystem encoding (sys.getfilesystemencoding()). Python uses this encoding to decode and encode all filenames. Saying that multiple encodings on a single system is a misconfiguration every time it comes up does not make it true. There's been multiple examples of how you can end up with multiple encodings of filenames on a single system listed in past threads: multiple users with different encodings for their locales, mounting remote filesystems, downloading a file To the existing list I'd add getting a package from pypi -- neither tar nor zip files contain encoding information about the filenames. Therefore if I create an sdist of a python module using non-ascii filenames using a locale of latin1 and then upload to pypi, people downloading that on a utf-8 using locale will end up not being able to use the module. * Specify an encoding per platform and stick to that. It doesn't work: on UNIX/BSD, the user chooses its own encoding and all programs will use it. The proposal is that you ignore that when talking about loading and creating (I mentioned distutils because my thought was that distutils could grow the ability to translate from the system locale to a chosen neutral encoding when running setup.py any of the dist commands but that doesn't address the issue when testing a module that you've just written so perhaps that's not necessary.) python modules. Python modules would have a set of defined filesystem encodings per system. This prevents getting a mixture of encodings of modules and having things work in one location but fail when used somewhere else. Instead, you get an upfront failure until you correct the encoding. Anyway, I don't see why it is a problem to have different encodings on different systems. Each system can use its own encoding. The bug that I'm trying to solve is a Python bug, not an OS bug. There is no OS bug here. There is perhaps an OS design flaw but it's not a flaw that will be going away soon (in part, because the present OS designers do not see it as an OS flaw... to them it's a bug in code that attempts to build a simpler interface on top of it.) * Change import semantics to allow specifying the encoding of the module on the filesystem (seems really icky). This is a very bad idea. I introduced PYTHONFSENCODING environment variable in Python 3.2, but then quickly removed it, because it introduced a lot of inconsistencies. Thanks for getting rid of that, PYTHONFSENCODING is a bad idea because it doesn't solve the underlying issues. However, when I say specifying the encoding of the module on the filesystem, I don't mean something global like PYTHONFSENCODING -- I mean something at the python code level:: import café encoded_as('latin1') After thinking about this one, though, I don't think it will work either. This takes care of importing modules where the fs encoding of the module is known but it doesn't where the fs encoding may be translated between platforms. I believe that this could arise when untarring a module on windows using winzip or similar that gives you the option of translating from utf-8 bytes into bytes that have meaning as characters on that platform, for instance. Do you have a solution to the problem? I haven't looked at your patch so perhaps you have an ingenous method of translating from the unicode representation of the module in the import statement to the bytes in arbitrary encodings on the filesystem that I haven't thought of. If you don't, however, then really - ASCII-only seems like the sanest of the three solutions I can think of. -Toshio pgpxKdCbo8dSk.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 03:51:05AM +0100, Victor Stinner wrote: For a lesson at school, it is nice to write examples in the mother language, instead of using raw english with ASCII identifiers and filenames. Then use this:: import cafe as café When you do things this way you do not have to translate between unknown encodings into unicode. Everything is within python source where you have a defined encoding. Teaching students to write non-portable code (relying on filesystem encoding where your solution is, don't upload to pypi anything that has non-ascii filenames) seems like the exact opposite of how you'd want to shape a young student's understanding of good programming practices. In a school, you can use the same configuration (encoding) on all computers. In a school computer lab perhaps. But not on all the students' and professors' machines. How many professors will be cursing python when they discover that the example code that they wrote on their Linux workstation doesn't work when the students try to use it in their windows computer lab? How many students will be upset when the code they turn in runs on their professor's test machine if the lab computers were booted into the Linux partition but not if the they were booted into Windows? * Specify an encoding per platform and stick to that. It doesn't work: on UNIX/BSD, the user chooses its own encoding and all programs will use it. (...) This prevents getting a mixture of encodings of modules (...) If you have an issue with encodings, when have to fix it when you create a module (on disk), not when you load a module (it is too late). It's not too late to throw a clear error of what's wrong. I haven't looked at your patch so perhaps you have an ingenous method of translating from the unicode representation of the module in the import statement to the bytes in arbitrary encodings on the filesystem that I haven't thought of. On Windows, My patch tries to avoid any conversion: it uses unicode everywhere. On other OSes, it uses the Python filesystem encoding to encode a module name (as it is done for any other operation on the filesystem with an unicode filename). The other interfaces are somewhat of a red herring here. As I wrote in another email, importing modules has ramifications that open(), for instance, does not. Additionally, those other filesystem operations have been growing the ability to take byte values and encoding parameters because unicode translation via a single filesystem encoding is a good default but not a complete solution. I think that this problem demands a complete solution, however, and it seems to me that limiting the scope of the problem is the most pleasant method to accomplish this. Your solution creates modules which aren't portable. One of my proposals creates python code which isn't portable. The other one suffers some of the same disadvantages as your solution in portability but allows for tools that could automatically correct modules. -- Python 3 supports bytes filename to be able to read/copy/delete undecodable filenames, filenames stored in a encoding different than the system encoding, broken filenames. It is also possible to access these files using PEP 383 (with surrogate characters). This is useful to use Python on an old system. If you don't, however, then really - ASCII-only seems like the sanest of the three solutions I can think of. But a (Python 3) module is not supposed to have a broken filename. If it is the case, you have better to fix its name, instead of trying to fix the problem later (in Python). We agree that there should not be broken module names. However it seems we very hotly disagree about the definition of that. You think that if a module is named appropriately on one system but is not portable to another system, that's fine. I think that portability between systems is very important and sacrificing that so that someone can locally use a module with non-ASCII characters doesn't have a justifiable reward. With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups), it is already possible to use non-ASCII module names. Tangent: This is not true about Linux. UTF-8 is a matter of the interpretation of the filesystem bytes that the user specifies by setting their system locale. Setting system locale to ASCII for use in system-wide scripts, is quite common as is changing locale settings in other parts of the world (as I can tell you from the bug reports colleagues CC me on to fix for the problems with unicode support in their python2 programs). Allowing module names incompatible with ascii without specifying an encoding will just lead to bug reports down the line. Relatively few programmers understand the difference between the python unicode abstraction and the byte representations possible for those strings. Allowing non-ascii characters in module filenames without specifying an
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 09:02:17PM -0800, Glenn Linderman wrote: On 1/19/2011 8:39 PM, Toshio Kuratomi wrote: use this:: import cafe as café When you do things this way you do not have to translate between unknown encodings into unicode. Everything is within python source where you have a defined encoding. This is a great way of converting non-portable module names, if the module ever leaves the bounds of its computer, and runs into problems there. You're missing a piece here. If you mandate ascii you can convert to a unicode name using import as because python knows that it has ascii text from the filesystem when it converts it to an abstract unicode string that you've specified in the program text. You cannot go the other way because python lacks the information (the encoding of the filename on the filesystem) to do the transformation. Your demonstration of such an easy solution to the concerns you raise convinces me more than ever that it is acceptable to allow non-ASCII module names. For those programmers in a single locale environment, it'll just work. And for those not in a single locale environment, there is your above simple solution to achieve portability without changing large numbers of lines of code. Does my demonstration that you can't do that mean that it's no longer acceptable? :-) /me guesses that the relative merits of being forced to write portable code vs convenience of writing a module name in your native script still has a different balance than in mine, thus the smiley :-) -Toshio pgpVg5DKpRDXA.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 12:51:29PM +0100, Victor Stinner wrote: Le mercredi 19 janvier 2011 à 20:39 -0800, Toshio Kuratomi a écrit : Teaching students to write non-portable code (relying on filesystem encoding where your solution is, don't upload to pypi anything that has non-ascii filenames) seems like the exact opposite of how you'd want to shape a young student's understanding of good programming practices. That was already discuted before: see PEP 3131. http://www.python.org/dev/peps/pep-3131/#common-objections If the teacher choose to use non-ASCII, (s)he is responsible to explain the consequences to his/her students :-) It's not discussed in that PEP section. The PEP section says this: People claim that they will not be able to use a library if to do so they have to use characters they cannot type on their keyboards. Whether you can type it at your keyboard or not is not the problem here. The problem is portability. The students and professors are sharing code with each other. But because of a mixture of operating systems (let alone locale settings), the code written by one partner is unable to run on the computer of the other. If non-ascii filenames without a defined encoding are considered a feature, python cannot even issue a descriptive error when this occurs. It can only say that it could not find the module but not why. A restriction on module names to ascii only could actually state that module names are not allowed to be non-ASCII when it encounters the import line. In a school, you can use the same configuration (encoding) on all computers. In a school computer lab perhaps. But not on all the students' and professors' machines. How many professors will be cursing python when they discover that the example code that they wrote on their Linux workstation doesn't work when the students try to use it in their windows computer lab? Because some students use a stupid or misconfigured OS, Python should only accept ASCII names? Just a note -- you'll get much farther if you refrain from calling names. It just makes me think that you aren't reading and understanding the issue I'm raising. My examples that you're replying to involve two properly configured OS's. The Linux workstations are configured with a UTF-8 locale. The Windows OS's use wide character unicode. The problem occurs in that the code that one of the parties develops (either the students or the professors) is developed on one of those OS's and then used on the other OS. So, why do Python 3 support non-ASCII filenames: it is very well known that non-ASCII filenames is the root in many troubles! Should we simply drop unicode support for all filenames? And maybe restrict bytes filenames to bytes in [0; 127]? Or better, restrict to [32; 126] (U+007f causes some troubles in some terminals). If you want to argue that because python3 supports non-ascii filenames in other code, then the logical extension is that the import mechanism should support importing module names defined by byte sequences. I happen to think that import has a lot of differences between it and other filenames as I've said three times now. I think that in 2011, non-ASCII filenames are well supported on all (modern) operating systems. Issues with non-ASCII filenames are OS specific and should be fixed by the user (the admin of the computer). Additionally, those other filesystem operations have been growing the ability to take byte values and encoding parameters because unicode translation via a single filesystem encoding is a good default but not a complete solution. If you are unable to configure correctly your system to decode/encode correctly filenames, you should just avoid non-ASCII characters in the module names. This seems like an argument to only have unicode versions of all filesystem operations. Since you've been spearheading the effort to have bytes versions of things that access filenames, environment variables, etc, I don't think that you seriously mean that. Perhaps there is a language issue here. You only give theorical arguments: did you at least try to use non-ASCII module names on your system with Python 3.2? I suppose that it will just work and you will never notice that the unicode module name (on import café) in encoded to bytes. Yes I did and I got it to fail a cornercase as I showed twice with the same example in other posts. However, I want to make clear here that the issue is not that I can create a non-ascii filename and then import it. The issue is that I can create a non-ascii filename and then try to share it with the usual tools and it won't work on the recipient's system. (A tangent is whether the recipient's system is physically distinct from mine or only has a different environment on the same physical host.) It fails on on OSes using filesystem encodings other than UTF-8 (eg. Windows)... because of a Python bug, and I just asked if I have
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 01:43:03PM -0500, Alexander Belopolsky wrote: On Thu, Jan 20, 2011 at 12:44 PM, Toshio Kuratomi a.bad...@gmail.com wrote: .. My examples that you're replying to involve two properly configured OS's. The Linux workstations are configured with a UTF-8 locale. The Windows OS's use wide character unicode. The problem occurs in that the code that one of the parties develops (either the students or the professors) is developed on one of those OS's and then used on the other OS. I re-read your posts on this thread, but could not find the examples that you refer to. Examples might be a bad word in this context. Victor was commenting on the two brainstorm ideas for alternatives to ascii-only that I had. One was: * Mandate that every python module on a platform has a specific encoding (rather than the value of the locale) The other was: * allow using byte strings for import I think that both ideas are inferior to mandating that every python module filename is ascii. From what I'm getting from Victor's posts is that he, at least, considers the portability problems to be ignorable because dealing with ambiguous file name encodings is something that he'd like to force third party tools to deal with. -Toshio pgpdh2k6Fwv56.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 03:27:08PM -0500, Glyph Lefkowitz wrote: On Jan 20, 2011, at 11:46 AM, Guido van Rossum wrote: Same here. *Most* code will never be shared, or will only be shared between users in the same community. When it goes wrong it's also a learning opportunity. :-) Despite my usual proclivity for being contrarian, I find myself in agreement here. Linux users with locales that don't specify UTF-8 frankly _should_ have to deal with all kinds of nastiness until they can transcode their filesystems. MacOS and Windows both have a right answer here and your third-party tools shouldn't create mojibake in your filenames. However, if this is the consensus, it makes a lot more sense to pick utf-8 as *the* encoding for python module filenames on Linux. Why UTF-8: * UTF-8 can cover the whole range of unicode whereas most (all?) other locale friendly encodings cannot. * UTF-8 is becoming a standard for Linux distributions whether or not Linux users are adopting it. * Third party tools are gaining support for UTF-8 even when they aren't gaining support for generic encodings (If I read the spec on zip correctly, this is actually what's happening there). Why not locale: * Relying on locale is simply not portable. If nothing prevents people from distributing a unicode filename then they will go ahead and do so. If the result works (say, because it's utf-8 and 80% of the Linux userbase is using utf-8) then it will get packaged and distributed and people won't know that it's a problem until someone with a non-utf-8 locale decids to use it. * Mixing of modules from different locales won't work. Suppose that the system python installs the previous module. The local site has other modules that it has installed using a different filename encoding. The users at the site will find that either one or hte other of the two modules won't work. * Because of the portability problems you have no choice but to tell people not to distribute python modules with non-ASCII names. This makes the use of unicode names second class indefintely (until the kernel devs decide that they're wrong to not enforce a filesystem encoding or Linux becomes irrelevant as a platform). * If you can pick a set of encodings that are valid (utf-8 for Linux and MacOS, wide unicode for windows [I get the feeling from other parts of the conversation that Windows won't be so lucky, though]) tools to convert python names become easier to write. If you restrict it far enough, you could even write tools/importers that automatically do the detection. PS: Sorry for not replying immediately, the team I'm on is dealing with an issue at my work and I'm also preparing for a conference later this week. -Toshio pgpq1C0qGW77C.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote: On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: * If you can pick a set of encodings that are valid (utf-8 for Linux and MacOS HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right here you've already broken Python modules on OSX. Others have been saying that Mac OSX's HFS+ uses UTF-8. But the question is not whether UTF-16 or UTF-8 is used by HFS+. It's whether you can sensibly decide on an encoding from the type of system that is being run on. This could be querying the filesystem or a check on sys.platform or some other method. I don't know what detection the current code does. On Linux there's no defined encoding that will work; file names are just bytes to the Linux kernel so based on people's argument that the convention is and should be that filenames are utf-8 and anything else is a misconfigured system -- python should mandate that its module filenames on Linux are utf-8 rather than using the user's locale settings. And as far as I know, Linux software/FS generally use NFC (I've already seen this issue cause trouble) Linux FS's are bytes with a small blacklist (so you can't use the NULL byte in a filename, for instance). Linux software would be free to use any normal form that they want. If one software used NFC and another used NFD, the FS would record two separate files with two separate filenames. Other programs might or might not display this correctly. Example: zsh$ touch cafe zsh$ python Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) import os import unicodedata a=u'café' b=unicodedata.normalize('NFC', a) c=unicodedata.normalize('NFD', a) open(b.encode('utf8'), 'w').close() open(c.encode('utf8'), 'w').close() os.listdir(u'.') [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', u'people-etc-changes.sha256sum', u'caf\xe9'] os.listdir('.') ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', 'people-etc-changes.sha256sum', 'caf\xc3\xa9'] ^D zsh$ ls -al . drwxrwxr-x. 2 badger badger 4096 Jan 25 07:46 . drwxr-xr-x. 17 badger badger 4096 Jan 24 18:27 .. -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 café zsh$ ls -al cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe zsh$ ls -al cafe? -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe Now in this case, the decomposed form of the filename is being displayed incorrectly and the shell treats the decomposed character as two characters instead of one. However, when you view these files in dolphin (the KDE file manager) you properly see café repeated twice. -Toshio pgp2jXsIKYdB7.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:24:54AM +0900, Stephen J. Turnbull wrote: Toshio Kuratomi writes: On Linux there's no defined encoding that will work; file names are just bytes to the Linux kernel so based on people's argument that the convention is and should be that filenames are utf-8 and anything else is a misconfigured system -- python should mandate that its module filenames on Linux are utf-8 rather than using the user's locale settings. This isn't going to work where I live (Tsukuba). At the national university alone there are hundreds of pre-existing *nix systems whose filesystems were often configured a decade or more ago. Even if the hardware and OS have been upgraded, the filesystems are usually migrated as-is, with OS configuration tweaks to accomodate them. Many of them use EUC-JP (and servers often Shift JIS). That means that you won't be able to read module names with ls, and that will make Python unacceptable for this purpose. I imagine that in Russia the same is true for the various Cyrillic encodings. Sure ... but with these systems, neither read-modules-as-locale or read-modules-as-utf-8 are a good solution to work, correct? Especially if the OS does get upgraded but the filesystems with user data (and user created modules) are migrated as-is, you'll run into situations where system installed modules are in utf-8 and user created modules are shift-jis and so something will always be broken. The only way to make sure that modules work is to restrict them to ASCII-only on the filesystem. But because unicode module names are seen as a necessary feature, the question is which way forward is going to lead to the least brokenness. Which could be locale... but from the python2 locale-related bugs that I get to look at, I doubt. I really don't think there is anything that can be done here except to warn people that Kids, these stunts are performed by highly-trained professionals. Don't try this at home! Of course they will anyway, but at least they will have been warned in sufficiently strong terms that they might pay attention and be able to recover when they run into bizarre import exceptions. So on the subject of warnings... I think a reason it's better to pick an encoding for the platform/filesystem rather than to use locale is because people will get an error or a warning at the appropriate time if that's the case -- the first time they attempt to create and import a module with a filename that's not encoded in the correct encoding for the platform. It's all very well to say: We wrote in the documentation on http://docs.python.org/distutils/introduction.html#Choosing-a-name that only ASCII names should be used when distributing python modules but if the interpreter doesn't complain when people use a non-ASCII filename we all know that they aren't going to look in the documentation; they'll try it and if it works they'll learn that habit. -Toshio pgpjrrsvd3wof.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:12:02AM +0100, Martin v. Löwis wrote: Am 26.01.2011 10:40, schrieb Victor Stinner: Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit : Why not locale: * Relying on locale is simply not portable. (...) * Mixing of modules from different locales won't work. (...) I don't understand what you are talking about. I think by portability, he means moving files from one computer to another. He argues that if Python would mandate UTF-8 for all file names on Unix, moving files in such a way would support portability, whereas using the locale's filename might not (if the locale use a different charset on the target system). While this is technically true, I don't think it's a helpful way of thinking: by mandating that file names are UTF-8 when accessed from Python, we make the actual files inaccessible on both the source and the target system. I don't understand the relation between the local filesystem encoding and the portability. I suppose that you are talking about the distribution of a module to other computers. Here the question is how the filenames are stored during the transfer. The user is free to use any tool, and try to find a tool handling Unicode correctly :-) But it's no more the Python problem. There are cases where there is no real transfer, in the sense in which you are using the word. For example, with NFS, you can access the very same file simultaneously on two systems, with no file name conversion (unless you are using NFSv4, and unless your NFSv4 implementations support the UTF-8 mandate in NFS well). Also, if two users of the same machine have different locale settings, the same file name might be interpreted differently. Thanks Martin, I think that you understand my view even if you don't share it. There's one further case that I am worried about that has no real transfer. Since people here seem to think that unicode module names are the future (for instance, the comments about redefining the C locale to include utf-8 and the comments about archiving tools needing to support encoding bits), there are eventually going to be unicode modules that become dependencies of other modules and programs. These will need to be installed on systems. Linux distributions that ship these will need to choose a filesystem encoding for the filenames of these. Likely the sensible thing for them to do is to use utf-8 since all the ones I can think of default to utf-8. But, as Stephen and Victor have pointed out, users change their locale settings to things that aren't utf-8 and save their modules using filenames in that encoding. When they update their OS to a version that has utf-8 python module names, they will find that they have to make a choice. They can either change their locale settings to a utf-8 encoding and have the system installed modules work or they can leave their encoding on their non-utf-8 encoding and have the modules that they've created on-site work. This is not a good position to put users of these systems in. -Toshio pgpRiKtOLoK13.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream
On Wed, Mar 02, 2011 at 01:14:32AM +0100, Martin v. Löwis wrote: I think a PEP would help, but in this case I would request that before the PEP gets written (it can be a really short one!) somebody actually go out and get consensus from a number of important distros. Besides Barry, do we have any representatives of distros here? Matthias Klose represents Debian, Dave Malcolm represents Redhat, and Dirkjan Ochtman represents Gentoo. I'm here from Fedora. -Toshio pgpvGuHioHuln.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream
On Thu, Mar 03, 2011 at 09:55:25AM +0100, Piotr Ożarowski wrote: [Guido van Rossum, 2011-03-02] On Wed, Mar 2, 2011 at 4:56 AM, Piotr Ożarowski pi...@debian.org wrote: [Sandro Tosi, 2011-03-02] On Wed, Mar 2, 2011 at 10:01, Piotr Ożarowski pi...@debian.org wrote: I co-maintain with Matthias a package that provides /usr/bin/python symlink in Debian and I can confirm that it will always point to Python 2.X. We also do not plan to add /usr/bin/python2 symlink (and I guess only accepted PEP can change that) Can you please explain why you NACK this proposed change? it encourages people to change /usr/bin/python symlink to point to python3.X which I'm strongly against (how can I tell that upstream author meant python3.X and not python2.X without checking the code?) But the same is already true for python2.X vs. python2.Y. Explicit is better than implicit etc. Plus, 5 years from now everybody is going to be annoyed that python still refers to some ancient unused version of Python. I don't really mind adding /usr/bin/python2 symlink just to clean Arch mess, but I do mind changing /usr/bin/python to point to python3 (and I can use the same argument - Explicit is better than implicit - if you need Python 3, say so in the shebang, right?). What I'm afraid of is when we'll add /usr/bin/python2, we'll start getting a lot of scripts that will have to be checked manually every time new upstream version is released because we cannot assume what upstream author is using at given point. If /usr/bin/python will be disallowed in shebangs on the other hand (and all scripts will use /usr/bin/python2, /usr/bin/python3, /usr/bin/python4 or /usr/bin/python2.6 etc.) I don't see a problem with letting administrators choose /usr/bin/python (right now not only changing it from python2.X to python3.X will break the system but also changing it from /usr/bin/pytohn2.X to /usr/bin/python2.Y will break it, and believe me, I know what I'm talking about (one of the guys at work did something like this once)) [all IMHO, dunno if other Debian's python-defaults maintainers agree with me] Thinking outside of the box, I can think of something that would satisfy your requirements but I don't know how appropriate it is for upstream python to ship with. Stop shipping /usr/bin/python. Ship python in an alternate location like $LIBEXECDIR/python2.7/bin (I think this would be /usr/lib/python2.7/bin on Debian and /usr/libexec/python2.7/bin on Fedora which would both be appropriate) then configure which python version is invoked by the user typing python by configuring PATH (a shell alias might also work). You could configure this with environment-modules[1]_ if Debian supports using that in packaging. Coupled with a PEP that recommends against using /usr/bin/python in scripts and instead using /usr/bin/python$MAJOR, this might be sufficient. OTOH, my cynical side doubts that script authors read PEPs so it'll take either upstream python shipping without /usr/bin/python or consensus among the distros to ship without /usr/bin/python to reach the point where script authors realize that they need to use /usr/bin/python{2,3} instead of /usr/bin/python. .. _[1]: http://modules.sourceforge.net/ -Toshio pgp97oSsV2cOw.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream
On Thu, Mar 03, 2011 at 09:11:40PM -0500, Barry Warsaw wrote: On Mar 03, 2011, at 02:17 PM, David Malcolm wrote: On a related note, we have a number of scripts packaged across the distributions with a shebang line that reads: #!/usr/bin/env python which AIUI follows upstream recommendations. Actually, I think this is *not* a good idea for distro provided scripts. For any Python scripts released by the distro, you know exactly which Python it should run on, so it's better to hard code it. That way, if someone installs Python from source, or installs an experimental version of a new distro Python, it won't break their system. Yes, this has happened to me. Also, note that distutils/setuptools/distribute rewrite the shebang line when they install scripts. There was a proposal to change these when packaging them to hardcode the specific python binary: https://fedoraproject.org/wiki/Features/SystemPythonExecutablesUseSystemPython on the grounds that a packaged system script is expecting (and has been tested against) a specific python build. That proposal has not yet been carried out. Ideally if we did this, we'd implement it as a postprocessing phase within rpmbuild, rather than manually patching hundreds of files. Note that this would only cover shebang lines at the tops of scripts. JFDI! FWIW, a quick grep reveals about two dozen such scripts in /usr/bin on Ubuntu. We should fix these. ;) Note, we were unable to pass Guideline changes to do this in Fedora. Gory details of the FPC meeting are at 16:15:03 (abadger1999 == me): http://meetbot.fedoraproject.org/fedora-meeting/2009-08-19/fedora-meeting.2009-08-19-16.01.log.html The mailing list thread where this was discussed is here: http://lists.fedoraproject.org/pipermail/packaging/2009-July/006248.html Note to dmalcolm: IIRC, that also means that the Feature page you point to isn't going to happen either. Barry -- if other distros adopted stronger policies, then that might justify me taking this back to the Packaging Committee. -Toshio pgpeLOL8uwMOh.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream
On Thu, Mar 03, 2011 at 09:46:23PM -0500, Barry Warsaw wrote: On Mar 03, 2011, at 09:08 AM, Toshio Kuratomi wrote: Thinking outside of the box, I can think of something that would satisfy your requirements but I don't know how appropriate it is for upstream python to ship with. Stop shipping /usr/bin/python. Ship python in an alternate location like $LIBEXECDIR/python2.7/bin (I think this would be /usr/lib/python2.7/bin on Debian and /usr/libexec/python2.7/bin on Fedora which would both be appropriate) then configure which python version is invoked by the user typing python by configuring PATH (a shell alias might also work). You could configure this with environment-modules[1]_ if Debian supports using that in packaging. I wonder if Debian's alternatives system would be appropriate for this? http://wiki.debian.org/DebianAlternatives No, alternatives is really only useful for a very small class of problems [1]_ and [2]_. For this discussion there's an additional problem which is that alternatives works by creating symlinks. Piotr Ożarowski wants to make /usr/bin/python not exist so that scripts would have to use either /usr/bin/python3 or /usr/bin/python2. If alternatives places a symlink there, it defeats the purpose of avoiding that path in the package itself. I will note, though that scripts that have /usr/bin/env and take the route of setting the PATH would still fall victim to this. I think that environment-modules can also set up aliases. If so, that wouldbe better than setting PATH for finding and removing python without a version in scripts. One further note on this since one of the other messages here had a reference to this that kinda rains on this parade: http://refspecs.linux-foundation.org/LSB_4.1.0/LSB-Languages/LSB-Languages/pylocation.html The LSB is a standard that Linux distributions may or may not follow -- unlike the FHS, the LSB goes beyond encoding what most distros already do to things that they think people should do. For instance, Debian derivatives might find the software installation section of LSB[3]_ to be a bit... hard to swallow. Fedora provides a package which aims to make a fedora system lsb compliant but doesn't install it by default since it drags in gobs of packages that are otherwise not necessary on many systems. However, it does specify /usr/bin/python so getting rid of /usr/bin/python at the Linux distribution level might not reach universal aclaim. A united front from upstream python through the python package maintainers on the Linux distros would probably be needed to get people thinking about making this change... and we still would likely have the ability to add /usr/bin/python back onto a system (for instance, as part of that lsb package I mentioned earlier.) .. [1]: https://fedoraproject.org/wiki/Packaging:EnvironmentModules#Introduction .. [2]: http://fedoraproject.org/wiki/Packaging:Alternatives#Recommended_usage .. [3]: http://refspecs.linux-foundation.org/LSB_4.1.0/LSB-Core-generic/LSB-Core-generic/swinstall.html -Toshio pgpRUO8y9NO0L.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream
On Fri, Mar 04, 2011 at 01:56:39PM -0500, Barry Warsaw wrote: I don't agree that /usr/bin/python should not be installed. The draft PEP language hits the right tone IMHO, and I would favor /usr/bin/python pointing to /usr/bin/python2 on Debian, but primarily used only for the interactive interpreter. Or IOW, I still want users to be able to type 'python' at a shell prompt and get the interpreter. Actually, my post was saying that these two can be decoupled. ie: It's possible to not have /usr/bin/python while still allowing users to type python at a shell prompt and get the interpreter. This is done by either redefining the PATH to include the directory that the interpreter named python is in or by creating an alias for python to the proper interpreter. Using the environment-modules tools is one solution that operated in this way. It also, incidentally, would let each user of a system choose whether python invoked python2 or python3 (and on Debian, which sub-version of those). A more hardcoded approach is to have the python package drop some configuration into /etc/profile.d/ style directories where the distribution places files that are run by default by the user's shell with the default startup files. -Toshio pgpVTu9R21jxR.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 395: Module Aliasing
On Fri, Mar 04, 2011 at 12:56:16PM -0500, Fred Drake wrote: On Fri, Mar 4, 2011 at 12:35 PM, Michael Foord fuzzy...@voidspace.org.uk wrote: That (below) is not distutils it is setuptools. distutils just uses `scripts=[...]`, which annoyingly *doesn't* work with setuptools. Right; distutils scripts are just sad. OTOH, entry-point based scripts are something setuptools got very, very right. Probably not perfect, but... I've not yet needed anything different in practice. Some of them can be annoying as hell when dealing with a system that also installs multiple versions of a module. But one could argue that's the fault of setuptools' version handling rather than the entry-points handling. -Toshio pgpUBRcxfWp3n.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream
On Tue, Mar 08, 2011 at 08:25:50AM +1000, Nick Coghlan wrote: On Tue, Mar 8, 2011 at 1:30 AM, Barry Warsaw ba...@python.org wrote: On Mar 04, 2011, at 12:00 PM, Toshio Kuratomi wrote: Actually, my post was saying that these two can be decoupled. ie: It's possible to not have /usr/bin/python while still allowing users to type python at a shell prompt and get the interpreter. This is done by either redefining the PATH to include the directory that the interpreter named python is in or by creating an alias for python to the proper interpreter. I personally would prefer aliasing rather than $PATH manipulation. Toshio's suggestion wouldn't work anyway - the /usr/bin/env python idiom will pick up a python alias no matter where it lives on $PATH. I thought I pointed out that env wouldn't work with PATH but I guess I just thought that silently in my head. Pointing that out was going to live in the same paragraph as saying that it does work with an alias:: $ sudo mv /usr/bin/python /usr/bin/python.bak $ alias python='/usr/bin/python2.7' $ python --version Python 2.7 $ cat test.py #! /bin/env python print 'hi' $ ./test.py /bin/env: python: No such file or directory $ mv /usr/bin/python.bak /usr/bin/python $ ./test.py hi -Toshio pgpwQudNGJDWc.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [PEPs] Support the /usr/bin/python2 symlink upstream
On Tue, Mar 08, 2011 at 06:43:19PM -0800, Glenn Linderman wrote: On 3/8/2011 12:02 PM, Terry Reedy wrote: On 3/7/2011 9:31 PM, Reliable Domains wrote: The launcher need not be called python.exe, and maybe it would be better called #@launcher.exe (or similar, depending on its exact function details). I do not know that the '#@' part is about, but pygo would be short and expressive. If my proposal to make a line starting with #@ to be used instead of the Unix #! (#@ could be on the first or second line, to allow cross-platform scripts to use both, and Windows only scripts to not have #! You'd need to allow for it to be on the third line as well. pep-0263 has already taken the second line if it's in a script that has a Unix shebang. ), then #@launcher.exe (and # @launcherw.exe I suppose) would reflect the functionality of the launcher, which need not be tightly tied to Python, if it uses a separate line. But the launcher should probably not be the thing invoked from the command line, only implicitly when running scripts by naming them as the first thing on the command line. I'm of the opinion that attempting to parse a Unix #! line, and intuit what would be the equivalent on Windows is unnecessarily complex and error prone, and assumes that the variant systems are configured using the same guidelines (which the Python community may espouse, but may not be followed by all distributions, sysadmins, or users). I do not have a Windows system so I don't have a horse in this race but if the argument is to avoid complexity, be careful that your proposed solution isn't more complex than what you're avoiding. ie:: Now that I've had this idea, one might want to create other 2nd character codes after the Unix #! line... one could have #! Unix command processor #@ Windows command processor #$ OS/2 command processor #% Alternate Windows command processor. One could even port it to Unix: #!/usr/bin/#@launcher #@c:\python2.6\python.exe #^/usr/bin/python2.5 #/usr/bin/mono/IronPython2.6 for .NET 4.0/ipy.exe # I made up the line above, having no knowledge of Mono, but I think you get the idea Choice of command line would be an environment variable, I suppose, that the launcher would look at, or if none, then a system-specific default. It would have to search forward in the file until it finds the appropriate prefix or a line not starting with #, or starting with # or ##, at which point it would give up. -Toshio pgpkYA49vPaay.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Module version variable
On Fri, Mar 18, 2011 at 07:40:43PM -0700, Guido van Rossum wrote: On Fri, Mar 18, 2011 at 7:28 PM, Greg Ewing greg.ew...@canterbury.ac.nz wrote: Tres Seaver wrote: I'm not even sure why you would want __version__ in 99% of modules: in the ordinary cases, a module's version should be either the Python version (for a module shipped in the stdlib), or the release of the distribution which shipped it. It's useful to be able to find out the version of a module you're using at run time so you can cope with API changes. I had a case just recently where the behaviour of something in pywin32 changed between one release and the next. I looked for an attribute called 'version' or something similar to test, but couldn't find anything. +1 on having a standard place to look for version info. I believe __version__ *is* the standard (like __author__). IIRC it was proposed by Ping. I think this convention is so old that there isn't a PEP for it. So yes, we might as well write it down. But it's really nothing new. There is a section in PEP8 about __version__ but it serves a slightly different purpose there: Version Bookkeeping If you have to have Subversion, CVS, or RCS crud in your source file, do it as follows. __version__ = $Revision: 88433 $ # $Source$ These lines should be included after the module's docstring, before any other code, separated by a blank line above and below. Personally, I've never found a need to access the repository revision programatically from my pyhon applications but I have needed to access the API version so it would make sense to me to change the meaning of __version__. -Toshio pgpr66xyWCYYt.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Security implications of pep 383
On Tue, Mar 29, 2011 at 07:23:25PM +0100, Michael Foord wrote: Hey all, Not sure how real the security risk is here: http://blog.omega-prime.co.uk/?p=107 Basically he is saying that if you store a list of blacklisted files with names encoded in big-5 (or some other non-utf8 compatible encoding) if those names are passed at the command line, or otherwise read in and decoded from an assumed-utf8 source with surrogate escaping, the surrogate escape decoded names will not match the properly decoded blacklisted names. The example is correct. The security risk is real. However, there's a flaw in the program and whether the question of whether there's also a flaw in python is not so certain. Here's the line I'd say is contentious:: blacklist = open(blacklist.big5, encoding='big5').read().split() The blacklist file contains a list of filenames. However, this code treats it as a list of strings. This a logic error in the program, and he should really be doing this:: blacklist = open(blacklist.big5, 'rb').read().split() Then, when comparing it against the values of sys.argv, either sys.argv gets converted into bytes (using the system locale since that's what was used to encode to unicode) or the items in blacklist get converted to unicode with surrogateescape. The possible flaw in python is this: Code like the blog poster wrote passes python3 without an error or a warning. This gives the programmer no feedback that they're doing something wrong until it actually bites them in the foot in deployed code. -Toshio pgpZiD1gfinFR.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Security implications of pep 383
On Tue, Mar 29, 2011 at 10:55:47PM +0200, Victor Stinner wrote: Le mardi 29 mars 2011 à 22:40 +0200, Lennart Regebro a écrit : The lesson here seems to be if you have to use blacklists, and you use unicode strings for those blacklists, also make sure the string you compare with doesn't have surrogates. No. '\u4f60\u597d'.encode('big5').decode('latin1') gives '§A¦n' which doesn't contain any surrogate character. The lesson is: if you compare Unicode filenames on UNIX, make sure that your system is correctly configured (the locale encoding must be the filesystem encoding). You're both wrong :-) Lennart is missing that you just need to use the same encoding + surrogateescape (or stick with bytes) for decoding the byte strings that you are comparing. You're missing that on UNIX there is no filesystem encoding so the idea of locale and filesystem encoding matching is false (and unnecessary -- the encodings that you use within python just need to be the same. They don't even need to match up to the reality of what's used on the filesystem or the user's locale.) -Toshio pgpbDIzKAesS3.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Security implications of pep 383
On Wed, Mar 30, 2011 at 08:36:43AM +0200, Lennart Regebro wrote: On Wed, Mar 30, 2011 at 07:54, Toshio Kuratomi a.bad...@gmail.com wrote: Lennart is missing that you just need to use the same encoding + surrogateescape (or stick with bytes) for decoding the byte strings that you are comparing. You lost me here. I need to do this for what? The lesson here seems to be if you have to use blacklists, and you use unicode strings for those blacklists, also make sure the string you compare with doesn't have surrogates. Really, surrogates are a red herring to this whole issue. The issue is that the original code was trying to compare two different transformations of byte sequences and expecting them to be equal. Let's say that you have the following byte value:: b_test_value = b'\xa4\xaf' This is something that's stored in a file or the filename of something on a unix filesystem or stored in a database or any number of other things. Now you want to compare that to another piece of data that you've read in from somewhere outside of python. You'd expect any of the following to work:: b_test_value == b_other_byte_value b_test_value.encode('utf-8', 'surrogateescape') == b_other_byte_value('utf-8', 'surrogateescape') b_test_value.encode('latin-1') == b_other_byte_value('latin-1') b_test_value.encode('euc_jp') == b_other_byte_value('euc_jp') You wouldn't expect this to work:: b_test_value.encode('latin-1') == b_other_byte_value('euc_jp') Once you see that, you realize that the following is only a specific case of the former, surrogateescape doesn't really matter:: b_test_value.encode('utf-8', 'surrogateescape') == b_other_byte_value('euc_jp') -Toshio pgpZiMIuYZION.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 396, Module Version Numbers
On Wed, Apr 06, 2011 at 11:04:08AM +0200, John Arbash Meinel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 ... #. ``__version_info__`` SHOULD be of the format returned by PEP 386's ``parse_version()`` function. The only reference to parse_version in PEP 386 I could find was the setuptools implementation which is pretty odd: In other words, parse_version will return a tuple for each version string, that is compatible with StrictVersion but also accept arbitrary version and deal with them so they can be compared: from pkg_resources import parse_version as V V('1.2') ('0001', '0002', '*final') V('1.2b2') ('0001', '0002', '*b', '0002', '*final') V('FunkyVersion') ('*funkyversion', '*final') Barry -- I think we want to talk about NormalizedVersion.from_parts() rather than parse_version(). bzrlib has certainly used 'version_info' as a tuple indication such as: version_info = (2, 4, 0, 'dev', 2) and version_info = (2, 4, 0, 'beta', 1) and version_info = (2, 3, 1, 'final', 0) etc. This is mapping what we could sort out from Python's sys.version_info. The *really* nice bit is that you can do: if sys.version_info = (2, 6): # do stuff for python 2.6(.0) and beyond nod People like to compare versions and the tuple forms allow that. Note that the tuples you give don't compare correctly. This is the order that they sort: (2, 4, 0) (2, 4, 0, 'beta', 1) (2, 4, 0, 'dev', 2) (2, 4, 0, 'final', 0) So that means, snapshot releases will always sort after the alpha and beta releases (and release candidate if you use 'c' to mean release candidate). Since the simple (2, 4, 0) tuple sorts before everything else, a comparison that doesn't work with the 2.4.0-alpha (or beta or arbitrary dev snapshots) would need to specify something like: (2, 4, 0, 'z') NormalizedVersion.from_parts() uses nested tuples to handle this better. But I think that even with nested tuples a naive comparison fails since most of the suffixes are prerelease strings. ie: ((2, 4, 0),) ((2, 4, 0), ('beta', 1)) So you can't escape needing a function to compare versions. (NormalizedVersion does this by letting you compare NormalizedVersions together). Barry if this is correct, maybe __version_info__ is useless and I shouldn't have brought it up at pycon? -Toshio pgpztjMBlMddF.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
On Tue, Jun 28, 2011 at 03:46:12PM +0100, Paul Moore wrote: On 28 June 2011 14:43, Victor Stinner victor.stin...@haypocalc.com wrote: As discussed before on this list, I propose to set the default encoding of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if open() is called without an explicit encoding and if the locale encoding is not UTF-8. Using the warning, you will quickly notice the potential problem (using Python 3.2.2 and -Werror) on Windows or by using a different locale encoding (.e.g using LANG=C). -1. This will make things harder for simple scripts which are not intended to be cross-platform. I use Windows, and come from the UK, so 99% of my text files are ASCII. So the majority of my code will be unaffected. But in the occasional situation where I use a £ sign, I'll get encoding errors, where currently things will just work. And the failures will be data dependent, and hence intermittent (the worst type of problem). I'll write a quick script, use it once and it'll be fine, then use it later on some different data and get an error. :-( I don't think this change would make things harder. It will just move where the pain occurs. Right now, the failures are intermittent on A) computers other than the one that you're using. or B) intermittent when run under a different user than yourself. Sys admins where I'm at are constantly writing ad hoc scripts in python that break because you stick something in a cron job and the locale settings suddenly become C and therefore the script suddenly only deals with ASCII characters. I don't know that Victor's proposed solution is the best (I personally would like it a whole lot more than the current guessing but I never develop on Windows so I can certainly see that your environment can lead to the opposite assumption :-) but something should change here. Issuing a warning like open used without explicit encoding may lead to errors if open() is used without an explicit encoding would help a little (at least, people who get errors would then have an inkling that the culprit might be an open() call). If I read Victor's previous email correctly, though, he said this was previously rejected. Another brainstorming solution would be to use different default encodings on different platforms. For instance, for writing files, utf-8 on *nix systems (including macosX) and utf-16 on windows. For reading files, check for a utf-16 BOM, if not present, operate as utf-8. That would seem to address your issue with detection by vim, etc but I'm not sure about getting £ in your input stream. I don't know where your input is coming from and how Windows equivalent of locale plays into that. -Toshio pgp7J0rQuExcz.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream)
On Fri, Aug 12, 2011 at 12:19:23PM -0400, Barry Warsaw wrote: On Aug 12, 2011, at 01:10 PM, Nick Coghlan wrote: 1. Accept the reality of that situation, and propose a mechanism that minimises the impact of the resulting ambiguity on end users of Python by allowing developers to be explicit about their target language. This is the approach advocated in PEP 394. 2. Tell the Arch developers (and anyone else inclined to point the python name at python3) that they're wrong, and the python symlink should, now and forever, always refer to a version of Python 2.x. FWIW, although I generally support the PEP, I also think that distros themselves have a responsibility to ensure their #! lines are correct, for scripts they install. Meaning, if it requires rewriting the #! line on OS package install, so be it. +1 with the one caveat... it's nice to upstream fixes. If there's a simple thing like python == python-2 and python3 == python-3 everywhere, this is possible. If there's something like python2 == python-2 and python-3 == python3 everywhere, this is also possible. The problem is that: the latter is not the case (python from python.org itself doesn't produce a python2 symlink on install) and historically the former was the case but since python-dev rejected the notion that python == python-2 that is no long true. As long as it's just Arch, there's still time to go with #2. #1 is not a complete solution (especially because /usr/bin/python2 will never exist on some historical systems [not ones I run though, so someone else will need to beat that horse :-)]) but is better than where we are now where there is no guidance on what's right and wrong at all. -Toshio pgpBwoEJ5g8Bg.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Using PEP384 Stable ABI for the lzma extension module
On Wed, Oct 05, 2011 at 06:14:08PM +0200, Antoine Pitrou wrote: Le mercredi 05 octobre 2011 à 18:12 +0200, Martin v. Löwis a écrit : Not sure what you are using it for. If you need to extend the buffer in case it is too small, there is absolutely no way this could work without copies in the general case because of how computers use address space. Even _PyBytes_Resize will copy the data. That's not a given. Depending on the memory allocator, a copy can be avoided. That's why the str += str hack is much more efficient under Linux than Windows, AFAIK. Even Linux will have to copy a block on realloc in certain cases, no? Probably so. How often is totally unknown to me :) http://www.gnu.org/software/libc/manual/html_node/Changing-Block-Size.html It depends on whether there's enough free memory after the buffer you currently have allocated. I suppose that this becomes a question of what people consider the general case :-) -Toshio pgpCHlc9jDncJ.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com