Re: [Python-Dev] Should ftplib use UTF-8 instead of latin-1 encoding?

2009-01-23 Thread Toshio Kuratomi
Oleg Broytmann wrote:
 On Fri, Jan 23, 2009 at 02:35:01PM -0500, rdmur...@bitdance.com wrote:
 Given that a Unix OS can't know what encoding a filename is in (*),
 I can't see that one could practically implement a Unix FTP server
 in any other way.
 
Can you believe there is a well-known program that solved the issue?! It
 is Apache web server! One can configure different directories and different
 file types to have different encodings. I often do that. One (sysadmin) can
 even allow users to do the configuration themselves via .htaccess local files.
I am pretty sure FTP servers could borrow some ideas from Apache in this
 area. But they don't. Pity. :(
 
AFAIK, Apache is in the same boat as ftp servers.  You're thinking of
the encoding inside of the files.  The problem is with the file names
themselves.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Antoine Pitrou wrote:
 Steven D'Aprano steve at pearwood.info writes:
 It depends on what you mean by temporary.

 Applications like OpenOffice can sometimes recover from an application 
 crash or even a systems crash and give you the opportunity to restore 
 the temporary files that were left lying around.
 
 For such files, you want deterministic naming in order to find them again, so
 you won't use the tempfile module...
 
Something that doesn't require deterministicly named tempfiles was Ted
T'so's explanation linked to earlier.

read data from important file
modify data
create tempfile
write data to tempfile
*sync tempfile to disk*
mv tempfile to filename of important file

The sync is necessary to ensure that the data is written to the disk
before the old file overwrites the new filename.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Martin v. Löwis wrote:
 Something that doesn't require deterministicly named tempfiles was Ted
 T'so's explanation linked to earlier.

 read data from important file
 modify data
 create tempfile
 write data to tempfile
 *sync tempfile to disk*
 mv tempfile to filename of important file

 The sync is necessary to ensure that the data is written to the disk
 before the old file overwrites the new filename.
 
 You still wouldn't use the tempfile module in that case. Instead, you
 would create a regular file, with the name base on the name of the
 important file.
 
Uhm... why?  The requirements are:

1) lifetime of the temporary file is in control of the app
2) filename is available to the app so it can move it after data is written
3) temporary file can be created on the same filesystem as the important
file.

All of those are doable using the tempfile module.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Martin v. Löwis wrote:
 The sync is necessary to ensure that the data is written to the disk
 before the old file overwrites the new filename.
 You still wouldn't use the tempfile module in that case. Instead, you
 would create a regular file, with the name base on the name of the
 important file.

 Uhm... why?
 
 Because it's much easier not to use the tempfile module, than to use it,
 and because the main purpose of the tempfile module is irrelevant to
 the specific application; the main purpose being the ability to
 auto-delete the file when it gets closed.
 
auto-delete is one of the nice features of tempfile.  Another feature
which is entirely appropriate to this usage, though, though, is creation
of a non-conflicting filename.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ext4 data loss

2009-03-12 Thread Toshio Kuratomi
Martin v. Löwis wrote:
 auto-delete is one of the nice features of tempfile.  Another feature
 which is entirely appropriate to this usage, though, though, is creation
 of a non-conflicting filename.
 
 Ok. In that use case, however, it is completely irrelevant whether the
 tempfile module calls fsync. After it has generated the non-conflicting
 filename, it's done.

If you're saying that it shouldn't call fsync automatically I'll agree
to that.  The message thread I was replying to seemed to say that
tempfiles didn't need to support fsync because they will be useless
after a system crash.  I'm just refuting that by showing that it is
useful to call fsync on tempfiles as one of the steps in preserving the
data in another file.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Integrate BeautifulSoup into stdlib?

2009-03-24 Thread Toshio Kuratomi
Stephen J. Turnbull wrote:
 Chris Withers writes:

   - debian has an outdated and/or broken version of your package.

 True, but just as for the package system you are advocating, it's
 quite easy to set up your apt to use third-party repositories of
 Debian-style packages.  The question is whether those repositories
 exist.  Introducing yet another, domain-specific package manager will
 make it less likely that they do, and it will cause more work for
 downstream distributors like Debian and RH.

I haven't seen this mentioned so --

For many sites (including Fedora, the one I work on), the site maintains
a local yum/apt repository of packages that are necessary for getting
certain applications to run.  This way we are able to install a system
with a distribution that is maintained by other people and have local
additions that add more recent versions only where necessary.  This has
the following advantages:

1) We're able to track our changes to the base OS.
2) If the OS vendor releases an update that includes our fixes, we're
able to consume it without figuring out on which boxes we have to delete
what type of locally installed file (egg, jar, gem,
/usr/local/bin/program, etc).
3) We're using the OS vendor package management system for everything so
junior system admins can bootstrap a new machine with only familiarity
with that OS.  We don't have to teach them about rpm + eggs + gems +
where to find our custom repositories of each.
4) If we choose to, we can separate out different repositories for
different sets of machines.  Currently we have the main local repo and
one repo that only the builders pull from.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Integrate BeautifulSoup into stdlib?

2009-03-24 Thread Toshio Kuratomi
Steve Holden wrote:

 Seems to me that while all this is fine for developers and Python users
 it's completely unsatisfactory for people who just want to use Python
 applications. For them it's much easier if each application comes with
 all dependencies including the interpreter.
 
 This may seem wasteful, but it removes many of the version compatibility
 issues that otherwise bog things down.
 
The upfront cost of bundling is lower but the maintenance cost is
higher.  For instance, OS vendors have developed many ways of being
notified of and dealing with security issues.  If there's a security
issue with gtkmozdev and the python bindings to it have to be
recompiled, OS vendors will be alerted to it and have the opportunity to
release updates on zero day, the day that the security announcement goes
out.

Bundled applications suffer in two ways here:
1) the developers of the applications are unlikely to be on vendor-sec
and so the opportunity for zero day fixes is lower.

2) the developer becomes responsible for fixing problems with the
libraries, something that they often do not.  This is especially true
when developers start depending, not only on newer features of some
libraries, but older versions of others (for API changes).  It's not
clear to many developers that requiring a newer version of a library is
at least supported by upstream whereas requiring an older version leaves
them as the sole responsible party.

3) Over time, bundled libraries tend to become forked versions.  And
worse, privately forked versions.  If three python apps all use slightly
different older versions of libfoo-python and have backported fixes,
added new features, etc it is a nightmare for a system administrator or
packager to get them running with a single version from the system
library or forward port them.  And because they're private forks the
developers lose out on collaborating on security, bugfixes, etc because
they are doing their work in isolation from the other forks.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Integrate BeautifulSoup into stdlib?

2009-03-24 Thread Toshio Kuratomi
David Cournapeau wrote:
 2009/3/24 Toshio Kuratomi a.bad...@gmail.com:
 Steve Holden wrote:

 Seems to me that while all this is fine for developers and Python users
 it's completely unsatisfactory for people who just want to use Python
 applications. For them it's much easier if each application comes with
 all dependencies including the interpreter.

 This may seem wasteful, but it removes many of the version compatibility
 issues that otherwise bog things down.

 The upfront cost of bundling is lower but the maintenance cost is
 higher.  For instance, OS vendors have developed many ways of being
 notified of and dealing with security issues.  If there's a security
 issue with gtkmozdev and the python bindings to it have to be
 recompiled, OS vendors will be alerted to it and have the opportunity to
 release updates on zero day, the day that the security announcement goes
 out.
 
 I don't think bundling should be compared to depending on the system
 libraries, but as a lesser evil compared to requiring multiple,
 system-wide installed libraries.
 
Well.. I'm not so sure it's even a win there.  If the libraries are
installed system-wide, at least the consumer of the application knows:

1) Where to find all the libraries to audit the versions when a security
issue is announced.
2) That the library is unforked from upstream.
3) That all the consumers of the library version have a central location
to collaborate on announcing fixes to the library.

With my distribution packager hat on, I can say I dislike both multiple
versions and bundling but I definitely dislike bundling more.

 3) Over time, bundled libraries tend to become forked versions.  And
 worse, privately forked versions.  If three python apps all use slightly
 different older versions of libfoo-python and have backported fixes,
 added new features, etc it is a nightmare for a system administrator or
 packager to get them running with a single version from the system
 library or forward port them.  And because they're private forks the
 developers lose out on collaborating on security, bugfixes, etc because
 they are doing their work in isolation from the other forks.
 
 This is a purely technical problem, and can be handled by good source
 control systems, no ?
 
No.  This is a social problem.  Good source control only helps if I am
tracking upstream's trunk so I'm aware of the direction that their
changes are headed.  But there's a wide range of reasons that
application developers that bundle libraries don't do that:

1) not enough time in a day.  I'm working full-time on making my
application better.  Plus I have to update all these bundled libraries
from time to time, testing that the updates don't break anything.  I
don't have time to track trunk for all these libraries -- I barely have
time to track releases.

2) My release schedule doesn't mesh with all of the upstream libraries
I'm bundling.  When I want to release Foo-1.0, I want to have some
assurance that the libraries I'm bundling with will do the right thing.
   Since releases see more testing than trunk, tracking trunk for twenty
bundled libraries is a lot less attractive than tracking release branches.

3) This doesn't help with the fact that my bundled version of the
library and your bundled version of the library are being developed in
isolation from each other.  This needs central coordination which people
who believe bundling libraries are very unlikely to pursue.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Integrate BeautifulSoup into stdlib?

2009-03-24 Thread Toshio Kuratomi
Tres Seaver wrote:
 David Cournapeau wrote:
 I am afraid that distutils, and
 setuptools, are not really the answer to the problem, since while they
 may (as intended) guarantee that Python applications can be installed
 uniformly across different platforms they also more or less guarantee
 that Python applications are installed differently from all other
 applications on the platform.
 I think they should be part of the solution, in the sense that they
 should allow easier packaging for the different platforms (linux,
 windows, mac os x and so on). For now, they make things much harder
 than they should (difficult to follow the FHS, etc...).
 
 FHS is something which packagers / distributors care about:  I strongly
 doubt that the end users will ever notice, particularly for silliness
 like 'bin' vs. 'sbin', or architecture-specific vs. 'noarch' rules.
 
That's because you're thinking of a different class of end-user than FHS
 is targeting.  Someone who wants to install a web application on a
limited number of machines (one in the home-desktop scenario) or someone
who makes their living helping people to install the software they've
written has a whole different view on things than someone who's trying
to install and maintain the software on fifteen computer labs in a
campus or the person who is trying to write software that is portable to
tens of different platforms in their spare time and every bit of
answering end user's questions, tracking other upstreams for security
bugs, etc, is time taken away from coding.

Following FHS means that the software will work for both end-users who
don't care about the nitty-gritty of the FHS and system administrators
of large sites.  Disregarding the FHS because it is silliness means
that system administrators are going to have to special-case your
application, decide not to install it at all, or pay someone else to
support it.

Note that those things do make sense sometimes.  For instance, when an
application is not intended to be distributed to a large number of
outside entities (facebook, flikr, etc) or when your revenue stream is
making money from installing and administering a piece of software for
other companies.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Integrate BeautifulSoup into stdlib?

2009-03-24 Thread Toshio Kuratomi
David Cournapeau wrote:
 On Wed, Mar 25, 2009 at 1:45 AM, Toshio Kuratomi a.bad...@gmail.com
wrote:
 David Cournapeau wrote:
 2009/3/24 Toshio Kuratomi a.bad...@gmail.com:
 Steve Holden wrote:

 Seems to me that while all this is fine for developers and Python
users
 it's completely unsatisfactory for people who just want to use Python
 applications. For them it's much easier if each application comes with
 all dependencies including the interpreter.

 This may seem wasteful, but it removes many of the version
compatibility
 issues that otherwise bog things down.

 The upfront cost of bundling is lower but the maintenance cost is
 higher.  For instance, OS vendors have developed many ways of being
 notified of and dealing with security issues.  If there's a security
 issue with gtkmozdev and the python bindings to it have to be
 recompiled, OS vendors will be alerted to it and have the
opportunity to
 release updates on zero day, the day that the security announcement
goes
 out.
 I don't think bundling should be compared to depending on the system
 libraries, but as a lesser evil compared to requiring multiple,
 system-wide installed libraries.

 Well.. I'm not so sure it's even a win there.  If the libraries are
 installed system-wide, at least the consumer of the application knows:

 1) Where to find all the libraries to audit the versions when a security
 issue is announced.
 2) That the library is unforked from upstream.
 3) That all the consumers of the library version have a central location
 to collaborate on announcing fixes to the library.

 Yes, those are problems, but installing multi libraries have a lot of
 problems too:
  - quickly, by enabling multiple version installed, people become very
 sloppy to handle versions of the dependencies, and this increases a
 lot the number of libraries installed - so the advantages above for
 system-wide installation  becomes intractable quite quickly

This is somewhat true.  Sloppiness and increased libraries are bad.  But
there are checks on this sloppiness.  Distributions, for instance, are
quite active about porting software to use only a subset of versions.
So in the open source world, there's a large number of players
interested in keeping the number of versions down.  Using multiple
libraries will point people at where work needs to be done whereas
bundling hides it behind the monolithic bundle.

  - bundling also supports a real user-case which cannot be solved by
 rpm/deb AFAIK: installation without administration privileges.

This is only sortof true.  You can install rpms into a local directory
without root privileges with a commandline switch.  But rpm/deb are
optimized for system administrators so the documentation on doing this
is not well done.  There can also be code issues with doing things this
way but those issues can affect bundled apps as well. And finally, since
rpm's primary use is installing systems, the toolset around it builds
systems.  So it's a lot easier to build a private root filesystem than
it is to cherrypick a single package.  It should be possible to create a
tool that merges a system rpmdb and a user's local rpmdb using the
existing API but I'm not aware of any applications built to do that yet.

  - multi-version installation give very fragile systems. That's
 actually my number one complain in python: setuptools has caused me
 numerous headache, and I got many bug reports because you often do not
 know why one version was loaded instead of another one.

I won't argue for setuptools' implementation of multi-version.  It
sucks.  But multi-version can be done well.  Sonames in C libraries are
a simple system that does this better.

 So I am not so convinced multiple-version is better than bundling - I
 can see how it sometimes can be, but I am not sure those are that
 important in practice.

Bundling is always harmful.  Whether multiple versioning is any better
is certainly debatable :-)

 No.  This is a social problem.  Good source control only helps if I am
 tracking upstream's trunk so I'm aware of the direction that their
 changes are headed.  But there's a wide range of reasons that
 application developers that bundle libraries don't do that:

 1) not enough time in a day.  I'm working full-time on making my
 application better.  Plus I have to update all these bundled libraries
 from time to time, testing that the updates don't break anything.  I
 don't have time to track trunk for all these libraries -- I barely have
 time to track releases.

 Yes, but in that case, there is nothing you can do. Putting everything
 in one project is always easier than splitting into modules, coding
 and deployment-wise. That's just one side of the speed of development
 vs maintenance issue IMHO.

 3) This doesn't help with the fact that my bundled version of the
 library and your bundled version of the library are being developed in
 isolation from each other.  This needs central coordination which people
 who believe bundling

Re: [Python-Dev] Integrate BeautifulSoup into stdlib?

2009-03-25 Thread Toshio Kuratomi
Barry Warsaw wrote:

 Tools like setuptools, zc.buildout, etc. seem great for developers but
 not very good for distributions.  At last year's Pycon I think there was
 agreement from the Linux distributors that distutils, etc. just wasn't
 very useful for them.
 
It's decent for modules but has limitations that we run up against
somewhat frequently.  It's a horror for applications.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Integrate BeautifulSoup into stdlib?

2009-03-26 Thread Toshio Kuratomi
David Cournapeau wrote:
 I won't argue for setuptools' implementation of multi-version.  It
 sucks.  But multi-version can be done well.  Sonames in C libraries are
 a simple system that does this better.
 
 I would say simplistic instead of simple :) what works for C won't
 necessarily work for python - and even in C, library versioning is not
 used that often except for a few core libraries. Library versioning
 works in C because C model is very simple. It already breaks for C++.

I'm not sure what you're talking about here.  Library versioning is used
for practically every library on a Linux system.  My limited exposure to
the BSDs and Solaris was the same.  (If you're only talking Windows,
well; does windows even have Sonames?) I can name only one library that
isn't versioned in Fedora right now and may have heard of five total.
Perhaps you are thinking of library symbols?  If so, there are only a
few libraries that are using that.  But specifying backwards
compatibility via soname is well known and ubiquitous.

 More high-level languages like C# already have a more complicated
 scheme (GAC) - and my impression is that it did not work that well.
 The SxS for dll on recent windows to handle multiple version is a
 nightmare too in my (limited) experience.
 
Looking at C#/Mono/.net for examples is perfectly horrid.  They've taken
inferior library versioning and bad development practices and added
technology (the GAC) as the solution.  If you want an idea of what
python should avoid at all costs, look to that arena for your answer.

* Note that setuptools' multi-version implementation shares some things
in common with the GAC.  For instance, using directories to separate
versions instead of filenames.  setuptools' implementation could be made
better by studying the GAC and taking things like caching of lookups
from it but I don't encourage this... I think the design itself is flawed.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] setuptools has divided the Python community

2009-03-26 Thread Toshio Kuratomi
Guido van Rossum wrote:
 On Wed, Mar 25, 2009 at 9:40 PM, Tarek Ziadé ziade.ta...@gmail.com wrote:
 I think Distutils (and therefore Setuptools) should provide some APIs
 to play with special files (like resources) and to mark them as being 
 special,
 no matter where they end up in the target system.

 So the code inside the package can use these files seamessly no matter
 what the system is
 and no matter where the files have been placed by the packager.

 This has been discussed already but not clearly defined.
 
 Yes, this should be done. PEP 302 has some hooks but they are optional
 and not available for the default case. A simple wrapper to access a
 resource file relative to a given module or package would be easy to
 add. It should probably support four APIs:
 
 - Open as a binary stream
 - Open as a text stream
 - Get contents as a binary string
 - Get contents as a text string
 
Depending on the definition of a resource there's additional
information that could be needed.  For instance, if resource includes
message catalogs, then being able to get the base directory that the
catalogs reside in is needed for passing to gettext.

I'd be very happy if resource didn't encompass that type of thing,
though... then we could have a separate interface that addressed the
issues with them.  I'll be at PyCon (flying in late tonight, though, and
leaving Sunday) if Tarek and others want to get ahold of me to discuss
possible ways to address what's a resource, what's not, and what we
would need to handle the different cases.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] setuptools has divided the Python community

2009-03-27 Thread Toshio Kuratomi
Guido van Rossum wrote:
 2009/3/26 Toshio Kuratomi a.bad...@gmail.com:
 Guido van Rossum wrote:
 On Wed, Mar 25, 2009 at 9:40 PM, Tarek Ziadé ziade.ta...@gmail.com wrote:
 I think Distutils (and therefore Setuptools) should provide some APIs
 to play with special files (like resources) and to mark them as being 
 special,
 no matter where they end up in the target system.

 So the code inside the package can use these files seamessly no matter
 what the system is
 and no matter where the files have been placed by the packager.

 This has been discussed already but not clearly defined.
 Yes, this should be done. PEP 302 has some hooks but they are optional
 and not available for the default case. A simple wrapper to access a
 resource file relative to a given module or package would be easy to
 add. It should probably support four APIs:

 - Open as a binary stream
 - Open as a text stream
 - Get contents as a binary string
 - Get contents as a text string

 Depending on the definition of a resource there's additional
 information that could be needed.  For instance, if resource includes
 message catalogs, then being able to get the base directory that the
 catalogs reside in is needed for passing to gettext.
 
 Well the whole point is that for certain loaders (e.g. zip files)
 there *is* no base directory. If you do need directories you won't be
 able to use PEP-302 loaders, and you can just use
 os.path.dirname(some_module.__file__).
 
Yep.  Having no base directory isn't sufficient in all cases.

So one way to fix this is to define resources so that these cases fall
outside of that.

Current setuptools works around this by having API in pkg_resources that
unzips when it's necessary to use a filename rather than just retrieving
the data from the file.  So a second option is to have other API methods
  that allow this.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-10 Thread Toshio Kuratomi
Robert Collins wrote:

 Certainly, import time is part of it:
 robe...@lifeless-64:~$ python -m timeit -s 'import sys;  import
 bzrlib.errors' del sys.modules['bzrlib.errors']; import bzrlib.errors
 10 loops, best of 3: 18.7 msec per loop
 
 (errors.py is 3027 lines long with 347 exception classes).
 
 We've also looked lower - python does a lot of stat operations search
 for imports and determining if the pyc is up to date; these appear to
 only really matter on cold-cache imports (but they matter a lot then);
 in hot-cache situations they are insignificant.
 
Tarek, Georg, and I talked about a way to do both multi-version and
speedup of this exact problem with import in the future at pycon.  I had
to leave before the hackfest got started, though, so I don't know where
the idea went from there.  Tarek, did this idea progress any?

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] #!/usr/bin/env python -- python3 where applicable

2009-04-20 Thread Toshio Kuratomi
Greg Ewing wrote:
 Steven Bethard wrote:
 
 That's an unfortunate decision. When the 2.X line stops being
 maintained (after 2.7 maybe?) we're going to be stuck with the 3
 suffix forever for the real Python.
 
 I don't see why we have to be stuck with it forever.
 When 2.x has faded into the sunset, we can start
 aliasing 'python' to 'python3' if we want, can't we?
 
You could, but it's not my favorite idea.  Gets people used to the idea
of python == python2 and python3 == python3 as something they can count
on.  Then says, Oops, that was just an implementation detail, we're
changing that now.  Much better to either make a clean break and call
the new language dialect python3 from now and forever or force people to
come up with solutions to whether /usr/bin/python == python2 or python3
right now while it's fresh and relevant in their minds.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Toshio Kuratomi
Glenn Linderman wrote:
 On approximately 4/24/2009 11:40 AM, came the following characters from
 And so my encoding (1) doesn't alter the data stream for any valid
 Windows file name, and where the naivest of users reside (2) doesn't
 alter the data stream for any Posix file name that was encoded as UTF-8
 sequences and doesn't contain ? characters in the file name [I perceive
 the use of ? in file names to be rare on Posix, because of experience,
 and because of the other problems caused by such use] (3) doesn't
 introduce data puns within applications that are correctly coded to know
 the encoding occurs.  The encoding technique in the PEP not only can
 produce data puns, thus not being reversible, it provides no reliable
 mechanism to know that this has occurred.
 
Uhm  Not arguing with your goals but '?' is unfortunately reasonably
easy to get into a filename.  For instance, I've had to download a lot
of scratch built packages from our buildsystem recently.  Scratch builds
have url's with query strings in them so::

wget
'http://koji.fedoraproject.org/koji/getfile?taskID=1318059name=monodevelop-debugger-gdb-2.0-1.1.i586.rpm'

Which results in the filename:
  getfile?taskID=1318059name=monodevelop-debugger-gdb-2.0-1.1.i586.rpm

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Toshio Kuratomi
Terry Reedy wrote:

 Is NUL \0 allowed in POSIX file names?  If not, could that be used as an
 escape char.  If it is not legal, then custom translated strings that
 escape in the wild would raise a red flag as soon as something else
 tried to use them.
 
AFAIK NUL should be okay but I haven't read a specification to reach
that conclusion.  Is that a proposal?  Should I go find someone who has
read the relevant standards to find out?

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi
Zooko O'Whielacronx wrote:
 On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:
 If you switch to iso8859-15 only in the presence of undecodable UTF-8,
 then you have the same round-trip problem as the PEP: both b'\xff' and
 b'\xc3\xbf' will be converted to u'\u00ff' without a way to
 unambiguously recover the original file name.
 
 Why do you say that?  It seems to work as I expected here:
 
 '\xff'.decode('iso-8859-15')
 u'\xff'
 '\xc3\xbf'.decode('iso-8859-15')
 u'\xc3\xbf'



 '\xff'.decode('cp1252')
 u'\xff'
 '\xc3\xbf'.decode('cp1252')
 u'\xc3\xbf'
 

You're not showing that this is a fallback path.  What won't work is
first trying a local encoding (in the following example, utf-8) and then
if that doesn't work, trying a one-byte encoding like iso8859-15:

try:
file1 = '\xff'.decode('utf-8')
except UnicodeDecodeError:
file1 = '\xff'.decode('iso-8859-15')
print repr(file1)

try:
file2 = '\xc3\xbf'.decode('utf-8')
except UnicodeDecodeError:
file2 = '\xc3\xbf'.decode('iso-8859-15')
print repr(file2)


That prints:
  u'\xff'
  u'\xff'

The two encodings can map different bytes to the same unicode code point
 so you can't do this type of thing without recording what encoding was
used in the translation.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi
Martin v. Löwis wrote:
 Since the serialization of the Unicode string is likely to use UTF-8,
 and the string for  such a file will include half surrogates, the
 application may raise an exception when encoding the names for a
 configuration file. These encoding exceptions will be as rare as the
 unusual names (which the careful I18N aware developer has probably
 eradicated from his system), and thus will appear late.
 
 There are trade-offs to any solution; if there was a solution without
 trade-offs, it would be implemented already.
 
 The Python UTF-8 codec will happily encode half-surrogates; people argue
 that it is a bug that it does so, however, it would help in this
 specific case.

Can we use this encoding scheme for writing into files as well?  We've
turned the filename with undecodable bytes into a string with half
surrogates.  Putting that string into a file has to turn them into bytes
at some level.  Can we use the python-escape error handler to achieve
that somehow?

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Toshio Kuratomi
Thomas Breuel wrote:
 Not for me (I am using Python 2.6.2).
 
  f = open(chr(255), 'w')
 Traceback (most recent call last):
  File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 
 
 
 You can get the same error on Linux:
 
 $ python
 Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
 [GCC 4.3.3] on linux2
 Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
 Traceback (most recent call last):
   File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'

 
 (Some file system drivers do not enforce valid utf8 yet, but I suspect
 they will in the future.)
 
Do you suspect that from discussing the issue with kernel developers or
reading a thread on lkml?  If not, then you're suspicion seems to be
pretty groundless

The fact that VFAT enforces an encoding does not lend itself to your
argument for two reasons:

1) VFAT is not a Unix filesystem.  It's a filesystem that's compatible
with Windows/DOS.  If Windows and DOS have filesystem encodings, then it
makes sense for that driver to enforce that as well.  Filesystems
intended to be used natively on Linux/Unix do not necessarily make this
design decision.

2) The encoding is specified when mounting the filesystem.  This means
that you can still mix encodings in a number of ways.  If you mount with
an encoding that has full byte coverage, for instance, each user can put
filenames from different encodings on there.  If you mount with utf8 on
a system which uses euc-jp as the default encoding, you can have full
paths that contain a mix of utf-8 and euc-jp.  Etc.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Promoting Python 3 [was: PyPy 1.7 - widening the sweet spot]

2011-11-22 Thread Toshio Kuratomi
On Wed, Nov 23, 2011 at 01:41:46AM +0900, Stephen J. Turnbull wrote:
 Barry Warsaw writes:
 
   Hopefully, we're going to be making a dent in that in the next version of
   Ubuntu.
 
 This is still a big mess in Gentoo and MacPorts, though.  MacPorts
 hasn't done anything about ceating a transition infrastructure AFAICT.
 Gentoo has its eselect python set VERSION stuff, but it's very
 dangerous to set to a Python 3 version, as many things go permanently
 wonky once you do.  (So far I've been able to work around problems
 this creates, but it's not much fun.)  I have no experience with this
 in Debian, Red Hat (and derivatives) or *BSD, but I have to suspect
 they're no better.  (Well, maybe Red Hat has learned from its 1.5.2
 experience! :-)
 
For Fedora (and currently, Red Hat is based on Fedora -- a little more about
that later, though), we have parallel python2 and python3 stacks.  As time
goes on we've slowly brought more python-3 compatible modules onto the
python3 stack (I believe someone had the goal a year and a half ago to get
a complete pylons web development stack running on python3 on Fedora which
brought a lot of packages forward).

Unlike Barry's work with Ubuntu, though, we're mostly chiselling around the
edges; we're working at the level where there's a module that someone needs
to run something (or run some optional features of something) that runs on
python3.

 I don't have any connections to the distros, so can't really offer to
 help directly.  I think it might be a good idea for users to lobby
 (politely!)  their distros to work on the transition.
 
Where distros aren't working on parallel stacks, there definitely needs to
be some transition plan.  With my experience with parallel stacks, the best
help there is to 1) help upstreams port to py3k (If someone can get PIL's
py3k support finished and into a released package, that would free up a few
things).  2) open bugs or help with creating python3 packages of modules
when the upstream support is there.

Depending on what software Barry's talking about porting to python3, that
could be a big incentive as well.  Just like with the push in Fedora to have
pylons run on python3, I think that having certain applications that run on
python3 and therefore need to have stacks of modules that support it is one
of the prime ways that distros become motivated to provide python3 packages
and support.  This is basically the killer app idea in a new venue :-)

-Toshio


pgp4H9ogaSy0g.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4 Release Manager

2011-11-22 Thread Toshio Kuratomi
On Tue, Nov 22, 2011 at 08:27:24PM -0800, Raymond Hettinger wrote:
 
 On Nov 22, 2011, at 7:50 PM, Larry Hastings wrote:
  But look!  I'm already practicing: NO YOU CAN'T CHECK THAT IN.  How's that? 
   Needs work?
 
 You could try a more positive leadership style:  THAT LOOKS GREAT, I'M SURE 
 THE RM FOR PYTHON 3.5 WILL LOVE IT ;-)
 
Wow!  My release engineering team needs to take classes from you guys!

-Toshio


pgpuU9lyX1YFu.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: Anyone still using Python 2.5?

2011-12-21 Thread Toshio Kuratomi
On Thu, Dec 22, 2011 at 02:49:06AM +0100, Victor Stinner wrote:
 
 Do people still have to use this in commercial environments or is
 everyone on 2.6+ nowadays?
 
 At work, we are still using Python 2.5. Six months ago, we started a
 project to upgrade to 2.7, but we have now more urgent tasks, so the
 upgrade is delayed to later. Even if we upgrade new clients to 2.7,
 we will have to continue to support 2.5 for some more months (or
 years?).
 
At my work, I'm on RHEL5 and RHEL6.  So I'm currently supporting python-2.4
and python-2.6.  We're up to 75% RHEL6 (though, not the machines where most
of our deployed, custom written apps are running) so I shouldn't have to
support python-2.4 for much longer.

 In a personal project (the IPy library), I dropped support of Python
 2.5 in february 2011. Recently, I got a mail asking me where the
 previous version of my library (supporting Python 2.4) can be
 downloaded! Someone is still using Python 2.4: I'm stuck with python
 2.4 in my work environment.
 
As part of work, I package for EPEL5 (addon packages for RHEL5).  Sometimes
we need a new version of a package or a new package for RHEL5 and thus need
to have python-2.4 compatible versions of the package and any of its
dependencies.

When I no longer need to maintain python-2.4 stuff for work, I'm hoping to
not have to do quite so much of this but sometimes I know I'll still get
requests to update an existing package to fix a bug or fix a feature and
that will require updates of dependent libraries.  I'll still be stuck
looking for python-2.4 compatible versions of all of these :-(

 What do people feel?
 
 For a new project, try to support Python 2.5, especially if you would
 like to write a portable library. For a new application working on
 Mac OS X, Windows and Linux, you can only support Python 2.6.
 
I agree that libraries have a need to go farther back than applications.
I have one library that I support on python-2.3 (for RHEL4... I'm counting
down the months on that one :-).  Every other library I maintain, I make sure
I support at least python-2.4.

Application-wise, I currently have to support python-2.4+ but given that
Linux distros seem to all have some version out that supports at least
python-2.6, I don't think I'll be developing any applications that
intentionally support less than that once I get moved away from RHEL-5 at my
workplace.

-Toshio


pgpxLKFA2jIf4.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2012-01-05 Thread Toshio Kuratomi
On Thu, Jan 05, 2012 at 08:35:57PM +, Paul Moore wrote:
 On 5 January 2012 19:33, David Malcolm dmalc...@redhat.com wrote:
  We have similar issues in RHEL, with the Python versions going much
  further back (e.g. 2.3)
 
  When backporting the fix to ancient python versions, I'm inclined to
  turn the change *off* by default, requiring the change to be enabled via
  an environment variable: I want to avoid breaking existing code, even if
  such code is technically relying on non-guaranteed behavior.  But we
  could potentially tweak mod_python/mod_wsgi so that it defaults to *on*.
  That way /usr/bin/python would default to the old behavior, but web apps
  would have some protection.   Any such logic here also suggests the need
  for an attribute in the sys module so that you can verify the behavior.
 
 Uh, surely no-one is suggesting backporting to ancient versions? I
 couldn't find the statement quickly on the python.org website (so this
 is via google), but isn't it true that 2.6 is in security-only mode
 and 2.5 and earlier will never get the fix?

I think when dmalcolm says backporting he means that he'll have to
backport the fix from modern, supported-by-python.org python to the ancient
python's that he's supporting as part of the Linux distributions where he's
the python package maintainer.

I'm thinking he's mentioning it here mainly to see if someone thinks that
his approach for those distributions causes anyone to point out a reason not
to diverge from upstream in that manner.

 Having a source-only
 release for 2.6 means the fix is off by default in the sense that
 you can choose not to build it. Or add a #ifdef to the source if it
 really matters.
 
I don't think that this would satisfy dmalcolm's needs.  What he's talking
about sounds more like a runtime switch (possibly only when initializing,
though, not on-the-fly).

-Toshio


pgp7qk95cGJ9b.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 411: Provisional packages in the Python standard library

2012-02-11 Thread Toshio Kuratomi
On Sat, Feb 11, 2012 at 04:32:56PM +1000, Nick Coghlan wrote:
 
 This would then be seen by pydoc and help(), as well as being amenable
 to programmatic inspection.
 
Would using
warnings.warn('This is a provisional API and may change radically from'
' release to release', ProvisionalWarning)

where ProvisionalWarning is a new exception/warning category (a subclaass of
FutureWarning?) be considered too intrusive?

-Toshio


pgpsUYqg9uSvm.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] #12982: Should -O be required to *read* .pyo files?

2012-06-13 Thread Toshio Kuratomi
On Wed, Jun 13, 2012 at 01:58:10PM -0400, R. David Murray wrote:
 
 OK, but you didn't answer the question :).  If I understand correctly,
 everything you said applies to *writing* the bytecode, not reading it.
 
 So, is there any reason to not use the .pyo file (if that's all that is
 around) when -O is not specified?
 
 The only technical reason I can see why -O should be required for a .pyo
 file to be used (*if* it is the only thing around) is if it won't *run*
 without the -O switch.  Is there any expectation that that will ever be
 the case?
 
Yes.  For instance, if I create a .pyo with -OO it wouldn't have docstrings.
Another piece of code can legally import that and try to use the docstring
for something.  This would fail if only the .pyo was present.

Of course, it would also fail under the present behaviour since no .py or
.pyc was present to be imported.  The error that's displayed might be
clearer if we fail when attempting to read a .py/.pyc rather than failing
when the docstring is found to be missing, though.

-Toshio


pgpqk9ErpLKEV.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python-3.0, unicode, and os.environ

2008-12-04 Thread Toshio Kuratomi
I opened up bug http://bugs.python.org/issue4006 a while ago and it was
suggested in the report that it's not a bug but a feature and so I
should come here to see about getting the feature changed :-)

I have a specific problem with os.environ and a somewhat less important
architectural issue with the unicode/bytes handling in certain os.*
modules.  I'll start with the important one:

Currently in python3 there's no way to get at environment variables that
are not encoded in the system default encoding.  My understanding is
that this isn't a problem on Windows systems but on *nix this is a huge
problem.  environment variables on *nix are a sequence of non-null
bytes.  These bytes are almost always characters but they do not have
to be.  Further, there is nothing that requires that the characters be
in the same encoding; some of the characters could be in the UTF-8
character set while others are in latin-1, shift-jis, or big-5.

These mixed encodings can occur for a variety of reasons.  Here's an
example that isn't too contrived :-)

Swallow is a multi-user shell server hosted at a university in Japan.
The OS installed is Fedora 10 where the encoding of all filenames
provided by the OS are UTF-8.  The administrator of the OS has kept this
convention and, among other things has created a directory to mount and
NFS directory from another computer.  He calls that ネットワーク
(network in Japanese).  Since it's utf-8, that gets put on the
filesystem as
'\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf'

Now the administrators of the fileserver have been maintaining it since
before Unicode was invented.  Furthermore, they don't want to suffer
from the space loss of using utf-8 to encode Japanese so they use
shift-jis everywhere.  They have a directory on the nfs share for
programs that are useful for people on the shell server to access.  It's
called プログラム (programs in Japanese)  Since they're using
shift-jis, the bytes on the filesystem are:
'\x83v\x83\x8d\x83O\x83\x89\x83\x80'

The system administrator of the shell server adds the directory of
programs to all his user's default PATH variables so then they have this:

PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80

(Note: python syntax, In the unix shell you'd likely have octal instead
of hex)

Now comes the problematic part.  One of the user's on the system wants
to write a python3 program that needs to determine if a needed program
is in the user's PATH.  He tries to code it like this::

#!/usr/bin/python3.0

import os

for directory in os.environ['PATH']:
programs = os.listdir(directory)

That code raises a KeyError because python3 has silently discarded the
PATH due to the shift-jis encoded path elements.  Much more importantly,
there's no way the programmer can handle the KeyError and actually get
the PATH from within python.

In the bug report I opened, I listed four ways to fix this along with
the pros and cons:

1) return mixed unicode and byte types in os.environ and os.getenv
   - I think this one is a bad idea.  It's the easiest for simple code
to deal with but it's repeating the major problem with python2's Unicode
handling: mixing unicode and byte types unpredictably.

2) return only byte types in os.environ
  - This is conceptually correct but the most annoying option.
Technically we're receiving bytes from the C libraries and the C
libraries expect bytes in return.  But in the common case we will be
dealing with things in one encoding so this causes needless effort to
the application programmer in the common case.

3) silently ignore non-decodable value when accessing os.environ['PATH']
as we do now but allow access to the full information via
os.environ[b'PATH'] and os.getenvb().
  - This mirrors the practice of os.listdir('.') vs os.listdir(b'.') and
os.getcwd() vs os.getcwdb().

4) raise an exception when non-decodable values are *accessed* and
continue as in #3.  This means that os.environ wouldn't be a simple dict
as it would need to decode the values when keys are accessed (although
it could cache the values).
  - This mirrors the practice of open() which is to decode the value for
the common case but throw an exception and allow the programmer to
decide what to do if all values are not decodable.

Either #3 or #4 will solve the major problem and both have precedent in
python3's current implementation.  The difference between them is
whether to throw an exception when a non-decodable value is encountered.
 Here's why I think that's appropriate:

One of the things I enjoy about python is the informative tracebacks
that make debugging easy.  I think that the ease of debugging is lost
when we silently ignore an error.  If we look at the difference in
coding and debugging for problems with files that aren't encoded in the
default encoding (where a traceback is issued) and os.listdir() when
filenames aren't in the default 

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-04 Thread Toshio Kuratomi
Adam Olsen wrote:
 On Thu, Dec 4, 2008 at 1:02 PM, Toshio Kuratomi [EMAIL PROTECTED] wrote:
 I opened up bug http://bugs.python.org/issue4006 a while ago and it was
 suggested in the report that it's not a bug but a feature and so I
 should come here to see about getting the feature changed :-)

 I have a specific problem with os.environ and a somewhat less important
 architectural issue with the unicode/bytes handling in certain os.*
 modules.  I'll start with the important one:

 Currently in python3 there's no way to get at environment variables that
 are not encoded in the system default encoding.  My understanding is
 that this isn't a problem on Windows systems but on *nix this is a huge
 problem.  environment variables on *nix are a sequence of non-null
 bytes.  These bytes are almost always characters but they do not have
 to be.  Further, there is nothing that requires that the characters be
 in the same encoding; some of the characters could be in the UTF-8
 character set while others are in latin-1, shift-jis, or big-5.
 
 Multiple encoding environments are best described as batshit insane.
  It's impossible to handle any of it correctly *as text*, which is why
 UTF-8 is becoming a universal standard.  For everybody's sanity python
 should continue to push it.
 
Amen brother!

 However, some pragmatism is also possible.

Unfortunately, this is exactly what I'm talking about :-)

  Many uses of PATH may
 allow it to be treated as black-box bytes, rather than text.  The
 minimal solution I see is to make os.getenv() and os.putenv() switch
 to byte modes when given byte arguments, as os.listdir() does.  This
 use case doesn't require the ability to iterate over all environment
 variables, as os.environb would allow.
 
This would be a partial implementation of my option #3.  It allows the
programmer to workaround problems but does allow subtle bugs to creep in
unawares.  For instance::

 I do wonder if controlling the environment given to a subprocess
 requires os.environb, but it may be too obscure to really matter.
 
If you wanted to change one variable before passing it on to the
subprocess this could lead to head-scratcher bugs.  Here's a contrived
example:  Say I have an app that talks to multiple cvs repositories.  It
copies os.environ and modifies CVSROOT and CVS_RSH then calls subprocess
with env=temp_env.  If the PATH variable contains non-decodable elements
on some machines, this could lead to mysterious failures.  This is
particularly bad because we aren't directly modifying PATH anywhere in
our code so there won't be an obvious reason in the code that this is
failing.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-04 Thread Toshio Kuratomi
Adam Olsen wrote:
 On Thu, Dec 4, 2008 at 2:09 PM, André Malo [EMAIL PROTECTED] wrote:
 * Adam Olsen wrote:
 On Thu, Dec 4, 2008 at 1:02 PM, Toshio Kuratomi [EMAIL PROTECTED]
 wrote:
 I opened up bug http://bugs.python.org/issue4006 a while ago and it was
 suggested in the report that it's not a bug but a feature and so I
 should come here to see about getting the feature changed :-)

 I have a specific problem with os.environ and a somewhat less important
 architectural issue with the unicode/bytes handling in certain os.*
 modules.  I'll start with the important one:

 Currently in python3 there's no way to get at environment variables
 that are not encoded in the system default encoding.  My understanding
 is that this isn't a problem on Windows systems but on *nix this is a
 huge problem.  environment variables on *nix are a sequence of non-null
 bytes.  These bytes are almost always characters but they do not have
 to be.  Further, there is nothing that requires that the characters be
 in the same encoding; some of the characters could be in the UTF-8
 character set while others are in latin-1, shift-jis, or big-5.
 Multiple encoding environments are best described as batshit insane.
  It's impossible to handle any of it correctly *as text*, which is why
 UTF-8 is becoming a universal standard.  For everybody's sanity python
 should continue to push it.
 Here's an example which will become popular soon, I guess: CGI scripts and,
 of course WSGI applications. All those get their environment in an unknown
 encoding. In the worst case one can blow up the application by simply
 sending strange header lines over the wire. But there's more: consider
 running the server in C locale, then probably even a single 8 bit char
 might break something (?).
 
 I think that's an argument that the framework should reencode all
 input text into the correct system encoding before passing it on to
 the CGI script or WSGI app.  If the framework doesn't have a clear way
 to determine the client's encoding then it's all just gibberish
 anyway.  A HTTP 400 or 500 range error code is appropriate here.
 
The framework can't always encode input bytes into the system encoding
for text.  Sometimes the framework can be dealing with actual bytes.
For instance, if the framework is being asked to reference an actual
file on a *NIX filesystem the bytes have to match up with the bytes in
the filename whether or not those bytes agree with the system encoding.

 
 However, some pragmatism is also possible.  Many uses of PATH may
 allow it to be treated as black-box bytes, rather than text.  The
 minimal solution I see is to make os.getenv() and os.putenv() switch
 to byte modes when given byte arguments, as os.listdir() does.  This
 use case doesn't require the ability to iterate over all environment
 variables, as os.environb would allow.

 I do wonder if controlling the environment given to a subprocess
 requires os.environb, but it may be too obscure to really matter.
 IMHO, environment variables are no text. They are bytes by definition and
 should be treated as such.
 I know, there's windows having unicode enabled env vars on demand, but
 there's only trouble with those over there in apache's httpd (when passing
 them to CGI scripts, oh well...).
 
 Environment variables have textual names, are set via text, frequently
 contain textual file names or paths, and my shell (bash in
 gnome-terminal on ubuntu) lets me put unicode text in just fine.  The
 underlying APIs may use bytes, but they're *intended* to be encoded
 text.
 
The example I've started using recently is this: text files on my system
contain character data and I expect them to be read into a string type
when I open them in python3.  However, if a text file contains text that
is not encoded in the system default encoding I should still be able to
get at the data and perform my own conversion.  So I agree with the
default of treating environment variables as text.  We just need to be
able to treat them as bytes when these corner cases come up.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-04 Thread Toshio Kuratomi
Terry Reedy wrote:
 Toshio Kuratomi wrote:
 I opened up bug http://bugs.python.org/issue4006 a while ago and it was
 suggested in the report that it's not a bug but a feature and so I
 should come here to see about getting the feature changed :-)
 
 It does you no good and (and will irritate others) to conflate 'design
 decision I do not agree with' with 'mistaken documentation or
 implementation of a design decision'.  The former is opinion, the latter
 is usually fact (with occasional border cases).  The latter is what core
 developers mean by 'bug'.
 
Noted.  However, there's also a difference between Prevents us from
doing useful things and Allows doing a useful thing in a non-trivial
manner.  The latter I would call a difference in design decision and
the former I would call a bug in the design.

 Currently in python3 there's no way to get at environment variables that
 are not encoded in the system default encoding.  My understanding is
 that this isn't a problem on Windows systems but on *nix this is a huge
 problem.  environment variables on *nix are a sequence of non-null
 bytes.  These bytes are almost always characters but they do not have
 to be.  Further, there is nothing that requires that the characters be
 in the same encoding; some of the characters could be in the UTF-8
 character set while others are in latin-1, shift-jis, or big-5.
 
 To me, mixing encodings within a string is at least slightly insane.  If
 by design, maybe even a 'design bug' ;-).
 
As an application level developer I echo your sentiment :-)  I
recognize, though, that *nix filesystem semantics were designed many
years before unicode and the decision to treat filenames, environment
variables, and so much else as bytes follows naturally from the C
definition of a char.  It's up to a higher level than the OS to decide
how to displa6 the bytes.

[shell server and fileserver result in this insane PATH]
 PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80

 
 I would think life would be ultimately easier if either the file server
 or the shell server automatically translated file names from jis and
 utf8 and back, so that the PATH on the *nix shell server is entirely
 utf8.

This is not possible because no part of the computer knows what the
encoding is.  To the computer, it's just a sequence of bytes.  Unlike
xml or the windows filesystem (winfs? ntfs?) where the encoding is
specified as part of the document/filesystem there's nothing to tell
what encoding the filenames are in.

  How would you ever display a mixture to users?

This is up to the application.  My recomendation would be to keep the
raw bytes (to access the file on the filesystem) and display the results
of str(filename, errors='replace') to the user.

  What if there
 were an ambiguous component that could be legally decoded more than one
 way?
 
The ambiguity is the reason that the fileserver and shell server can't
automatically translate the filename (many encodings merely use all of
the 2^8 byte combinations available in a C char type.  This makes the
byte decodable in any one of those encodings).  In the application, only
using the raw bytes to access the file also prevents ambiguity because
the raw bytes only references one file.

 Now comes the problematic part.  One of the user's on the system wants
 to write a python3 program that needs to determine if a needed program
 is in the user's PATH.  He tries to code it like this::

 #!/usr/bin/python3.0

 import os

 for directory in os.environ['PATH']:
 programs = os.listdir(directory)

 That code raises a KeyError because python3 has silently discarded the
 PATH due to the shift-jis encoded path elements.  Much more importantly,
 there's no way the programmer can handle the KeyError and actually get
 the PATH from within python.
 
 Have you tried os.system or os.popen or the subprocess module to use and
 get a response from a native *nix command?  On Windows
 
Sure, you can subprocess your way out of a lot of sticky situations
since you're essentially delegating the task to a C routine.  But there
are drawbacks:

* You become dependent on an external program being available.  What
happens if your code is run in a chroot, for instance?
* Do we want anyone writing programs that access the environment on *NIX
to have to discover this pattern themselves and implement it?

As for wrapping this up in os.*, that isn't necessary -- the python3
interpreter already knows about the byte-oriented environment; it just
isn't making it available to people programming in python.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-04 Thread Toshio Kuratomi
Adam Olsen wrote:
 On Thu, Dec 4, 2008 at 2:19 PM, Nick Coghlan [EMAIL PROTECTED] wrote:
 Toshio Kuratomi wrote:
 The bug report I opened suggests creating a PEP to address this issue.
 I think that's a good idea for whether os.listdir() and friends should
 be changed to raise an exception but not having any way to get at some
 environment variables seems like it's just a bug that needs to be
 addressed.  What do other people think on both these issues?
 I'm pretty sure the discussion on this topic a while back decided that
 where necessary Python 3 would grow parallel bytes versions of APIs
 affected by environmental encoding issues (such as os.environb,
 os.listdirb, os.getcwdb), but that we were OK with the idea of deferring
 addition of those APIs until 3.1.
 
 It looks like most of them got into 3.0.
 http://docs.python.org/3.0/library/os.html says All functions
 accepting path or file names accept both bytes and string objects, and
 result in an object of the same type, if a path or file name is
 returned.
 
nod  I'm very glad this is coming along.  Just want to make sure the
environment is also handled in 3.1.
 
 That is, this was an acknowledged limitation with a fairly
 straightforward agreed solution, but it wasn't considered a common
 enough issue to delay the release of 3.0 until all of those parallel
 APIs had been implemented
 
 Aye.  IMO it's fairly clear that os.getenv()/os.putenv() should follow
 suit in 3.1.  I'm not so sure about adding os.environb (and making
 subprocess use it), unless the OP can demonstrate they really need it.
 
Note: subprocess currently uses the real environment (the raw
environment as given to the python interpreter) when it is started
without the `env` parameter.  So the question would be what people
overriding the env parameter on their own need to do.

To be non-surprising I'd think they'd want to have a way to override
just a few variables from the raw environment.  Otherwise you have to
know which variables the program you're calling relies on and make sure
that those are set or call os.getenvb() to retrieve the byte version and
add it to your copy of os.environ before passing that to subprocess.

One example of something that would be even harder to implement without
access to the os.environb dictionary would be writing a program that
wraps make.  Since make takes all the variables from the environment and
transforms them into make variables you need to pass everything from the
environment that you are not modifying into the command.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Terry Reedy wrote:
 Toshio Kuratomi wrote:

 I would think life would be ultimately easier if either the file server
 or the shell server automatically translated file names from jis and
 utf8 and back, so that the PATH on the *nix shell server is entirely
 utf8.

 This is not possible because no part of the computer knows what the
 encoding is.  To the computer, it's just a sequence of bytes.  Unlike
 xml or the windows filesystem (winfs? ntfs?) where the encoding is
 specified as part of the document/filesystem there's nothing to tell
 what encoding the filenames are in.
 
 I thought you said that the file server keep all filenames in shift-jis,
 and the shell server all in utf-8.

Yes.  But this is part of the setup of the example to keep things
simple.  The fileserver or shell server could themselves be of mixed
encodings (for instance, if it was serving home directories to users all
over the world each user might be using a different encoding.)

  If so, then the shell server could
 know if it were told so.
 

Where are you going to store that information?  In order for python to
run without errors, will it have to be configured on each system it's
installed on to know the encoding of each filename?  Or are we going to
try to talk each *NIX vendor into creating new filesystems that record
that information and after a five year span of time declare that python
will not run on other filesystems in corner cases?

I think that this way does not hold a reasonable expectation of keeping
python a portable language.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Victor Stinner wrote:
 Hi,
 
 Le Thursday 04 December 2008 21:02:19 Toshio Kuratomi, vous avez écrit :
 
 These mixed encodings can occur for a variety of reasons.  Here's an
 example that isn't too contrived :-)
 (...)
 Furthermore, they don't want to suffer from the space loss of using 
 utf-8 to encode Japanese so they use shift-jis everywhere.
 
 space loss? Really? If you configure your server correctly, you should get 
 UTF-8 even if the file system is Shift-JIS. But it would be much easier to 
 use UTF-8 everywhere.
 
 Hum... I don't think that the discussion is about one specific server, but 
 the 
 lack of bytes environment variables in Python3 :-)

Yep.  I can't change the logicalness of the policies of a different
organization, only code my application to deal with it :-)

 1) return mixed unicode and byte types in ...
 
 NO!
 
It's nice that we agree... but I would prefer if you leave enough
context so that others can see that we agree as well :-)

 2) return only byte types in os.environ
 
 Hum... Most users have UTF-8 everywhere (eg. all Windows users ;-)), and 
 Python3 already use Unicode everywhere (input(), open(), filenames, ...).

We're also in agreement here.

 3) silently ignore non-decodable value when accessing os.environ['PATH']
 as we do now but allow access to the full information via
 os.environ[b'PATH'] and os.getenvb()
 
 I don't like os.environ[b'PATH']. I prefer to always get the same result 
 type... But os.listdir() doesn't respect that :-(
 
os.listdir(str) - list of str
os.listdir(bytes) - list of bytes
 
 I would prefer a similar API for easier migration from Python2/Python3
 (unicode). os.environb sounds like the best choice for me.
 
nod.  After thinking about how it would be used in subprocess calls I
agree.  os.environb would allow us to retrieve the full dict as bytes.
os.environ[b''] only works on individual keys.  Also os.getenv serves
the same purpose as os.environ[b''] would whereas os.environb would have
 its own uses.

 
 But they are open questions (already asked in the bug tracker):
 
I answered these in the bug tracker.  Here are the answers for the
mailing list:

 (a) Should os.environ be updated if os.environb is changed? If yes, how?
os.environb['PATH'] = '\xff' (or any invalid string in the system 
  default encoding)
= os.environ['PATH'] = ???
 
The underlying environment that both variables reflect should be updated
but what is displayed by os.environ should continue to follow the same
rules.  So if we follow option #3::
 os.environb['PATH'] = b'\xff'
 os.environ['PATH'] = raises KeyError because PATH is not a key in
the unicode decoded environment.

(option #4 would issue a UnicodeDecodeError instead of a KeyError)

Similarly, if you start with a variable in os.environb that can only be
represented as bytes and your program transforms it into something that
is decodable it should then show up in os.environ.

 (b) Should os.environb be updated if os.environ is changed? If yes, how?
 
 The problem comes with non-Unicode locale (eg. latin-1 or ASCII): most 
 charset 
 are unable to encode the whole Unicode charset (eg. codes = 65535).
 
os.environ['PATH'] = chr(0x1)
= os.environb['PATH'] = ???

Ah, this is a good question.  I misunderstood what you were getting at
when you posted this to the bug report.  I see several options but the
one that seems the most sane is to raise UnicodeEncodeError when setting
the value.  With that, proper code to set an environment variable might
look like this::

LANG=C python3.0
 variable = chr(0x1)
 try:
 # Unicode aware locales
 os.environ['MYVAR'] = variable
 except UnicodeEncodeError:
 # Non-Unicode locales
 os.environb['MYVAR'] = bytes(variable, encoding='utf8')

 (c) Same question when a key is deleted (del os.environ['PATH']).
 
Update the underlying env so both os.environ and os.environb reflect the
change.  Deleting should not hold the problems that updating does.

 If Python 3.1 will have os.environ and os.environb, I'm quite sure that some 
 modules will user os.environ and other will prefer os.environb. If both 
 environments are differents, the two modules set will work differently :-/
 
Exactly.  So making sure they hold the same information is a priority.

 It would be maybe easier if os.environ supports bytes and unicode keys. But 
 we 
 have to keep these assertions:
os.environ[bytes] - bytes
os.environ[str] - str
 
I think the same choices have to be made here.  If LANG=C, we still have
to decide what to do when os.environ[str] is set to a non-ASCii string.

Additionally, the subprocess question makes using the key value
undesirable compared with having a separate os.environb that accesses
the same underlying data.

 4) raise an exception when non-decodable values are *accessed* and
 continue as in #3.
 
 I like os.listdir() behaviour: just *ignore* non-decodable files. If you 
 really want to access

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Guido van Rossum wrote:
 On Fri, Dec 5, 2008 at 2:27 AM, Ulrich Eckhardt [EMAIL PROTECTED] wrote:
 In 99% of all cases, using the default encoding will work and do what people
 expect, which is why I would make this conversion automatic. In all other
 cases, it will at least not fail silently (which would lead to garbage and
 data loss) and allow more sophisticated applications to handle it.
 
 I think the always fail noisily approach isn't the best approach.
 E.g. if I am globbing for *.py, and there's an undecodable .txt file
 in a directory, its presence shouldn't cause the glob to fail.
 
But why should it make glob() fail?  This sounds like an implementation
detail of glob.  Here's some pseudo-code::

def glob(pattern):
string = False
if isinstance(pattern, str):
string = True
if platform == 'POSIX':
pattern = bytes(pattern, encoding=defaultencoding)
rawfiles = os.listdir(os.path.dirname(pattern) or pattern)
if string and platform == 'POSIX':
return [str(f) for f in rawfiles if match(f, pattern)]
else:
return rawfiles

This way the traceback occurs if anything in the result set is
undecodable.  What am I missing?

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Guido van Rossum wrote:
 Glob was just an example. Many use cases for directory traversal
 couldn't care less if they see *all* files.
 
Okay.  Makes it harder to prove correct or not if I don't know what the
use case is :-)  I can't think of a single use case off-hand.

Even your example of a ??.txt file making retrieval of *.py files fail
is a little broken.  If there was a ??.py file that was undecodable the
program would most likely want to know that file existed.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Guido van Rossum wrote:
 At the risk of bringing up something that was already rejected, let me
 propose something that follows the path taken in 3.0 for filenames,
 rather than doubling back:
 
 For os.environ, os.getenv() and os.putenv(), I think a similar
 approach as used for os.listdir() and os.getcwd() makes sense: let
 os.environ skip variables whose name or value is undecodable, and have
 a separate os.environb() which contains bytes; let os.getenv() and
 os.putenv() do the right thing when the arguments passed in are bytes.
 
I prefer the method used by file.read() where an error is thrown when
accessing undecodable data.  I think in time python programmers will
consider not throwing an exception a wart in python3.  However, this is
enough to allow programmers to do the right thing once an error is
reported by users and the cause has been tracked down so it doesn't
block fixing errors as the current code does.

And it's not like anyone expected python3 to be wart-free just because
the python2 warts were fixed ;-)

 For sys.argv, because it's positional, you can't skip undecodable
 values, so I propose to use error=replace for the decoding; again, we
 can add sys.argvb that contains the raw bytes values. The various
 os.exec*() and os.spawn*() calls (as well as os.system(), os.popen()
 and the subprocess module) should all accept bytes as well as strings.
 
This also seems sane with the same comment about throwing errors.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Victor Stinner wrote:
 It would be maybe easier if os.environ supports bytes and unicode keys.
 But we have to keep these assertions:
os.environ[bytes] - bytes
os.environ[str] - str
 I think the same choices have to be made here.  If LANG=C, we still have
 to decide what to do when os.environ[str] is set to a non-ASCii string.
 
 If the charset is US-ASCII, os.environ will drop non-ASCII values. But most 
 variables are ASCII only. Examples with my shell:
 
Yes.  But you still have the question of what to do when:
os.environ[str] = chr(0x1)

So I don't think it makes things simpler than having separate os.environ
and os.environb that update the same data behind the scenes.

 Additionally, the subprocess question makes using the key value
 undesirable compared with having a separate os.environb that accesses
 the same underlying data.
 
 The user should be able to choose bytes or unicode. Examples:

the subprocess question was posed further up the thread as basically --
does the user need to access os.environb in order to override things in
the environment when calling subprocess?  I think the answer to that is
yes since you might want to start with your environment and modify it
slightly when you call programs via subprocess.  If you just try to copy
os.environ and os.environ only iterates through the decodable env vars,
that doesn't work.  If you have an os.environb to copy it becomes possible.

  - subprocess.Popen('ls') = use unicode environment (os.environ)
  - subprocess.Popen(b'ls') = use bytes environment (os.environb)
 
That's... not expected to me :-(

If I never touch os.environ and invoke subprocess the normal way, I'd
still expect the whole environment to be passed on to the program being
called.  This is how invoking programs manually, shell scripting,
invoking programs from perl, python2, etc work.

Also, it's not really a good fit with the other things that key off of
the initial argument.  os.listdir(b'.') changes the output to bytes.
subprocess.Popen(b'ls') would change what environment gets input into
the call.

 Here's my problem with it, though.  With these semantics any program
 that works on arbitrary files and runs on *NIX has to check
 os.listdir(b'') and do the conversion manually.
 
 Only programs that have to support strange environment like yours (mixing 
 Shift-JIS and UTF-8) :-) Most programs don't have to support these charset 
 mixture.
 
Any program that is intended to be distributed, accesses arbitrary
files, and works on *nix platforms needs to take this into account.
Just because the environment inside of my organization is sane doesn't
mean that when we release the code to customers, clients, or the free
software community that the places it runs will be as strict about these
things.

Are most programs specific to one organization or are they distributed
to other people?  I can't answer that... everything I work on (except
passwords:-) is distributed -- from sys admin cronjobs to web
applications since I'm lucky that my whole job is devoted to working on
free software.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Nick Coghlan wrote:
 Toshio Kuratomi wrote:
 Are most programs specific to one organization or are they distributed
 to other people?
 
 The former. That's pretty well documented in assorted IT literature
 ('shrink-wrap' and open source commodity software are still relatively
 new players on the scene that started to shift the balance the other
 way, but now the server side elements of web services are shifting it
 back again).
 
Cool.  So it's only people writing code to be shared with the larger
community or written for multiple customers that are affected by bugs
like this. :-/

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-05 Thread Toshio Kuratomi
Nick Coghlan wrote:
 Toshio Kuratomi wrote:
 Guido van Rossum wrote:
 Glob was just an example. Many use cases for directory traversal
 couldn't care less if they see *all* files.

 Okay.  Makes it harder to prove correct or not if I don't know what the
 use case is :-)  I can't think of a single use case off-hand.

 Even your example of a ??.txt file making retrieval of *.py files fail
 is a little broken.  If there was a ??.py file that was undecodable the
 program would most likely want to know that file existed.
 
 Why? Most programs won't be able to do anything with it. And if the
 program *can* do something with it... that's what the bytes version of
 the APIs are for.
 
Nonsense.  A program can do tons of things with a non-decodable
filename.  Where it's limited is non-decodable filedata.

For instance, if you have a graphical text editor, you need to let the
user select files to load.  To do that you need to list all the files in
a directory, even the ones that aren't decodable.  The ones that aren't
decodable need to substitute something like:
  str(filename, errors='replace') + '(Filename not encoded in UTF8)'
in the file listing that the user sees.  When the file is loaded, it
needs to access the actual raw filename.  The file can then be loaded
and operated upon and even saved back to disk using the raw, undecodable
filename.

If you have a file manager, you need to code something that let's the
user move the file around.  Once again, the program loads the raw
filenames.  It transforms the name into something representable to the
user.  It displays that.  The user selects it and asks that it be moved
to another location.  Then the program uses the raw filename to move
from one location to another.

If you have a backup program, you need to list all the files in a
directory.  Then you need to copy those files to another location.  Once
again you have to retrieve the byte version of any non-decodable filenames.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Toshio Kuratomi
Nick Coghlan wrote:
 Toshio Kuratomi wrote:

 Nonsense.  A program can do tons of things with a non-decodable
 filename.  Where it's limited is non-decodable filedata.
 
 You can't display a non-decodable filename to the user, hence the user
 will have no idea what they're working on. Non-filesystem related apps
 have no business trying to deal with insane filenames.
 
This is where we disagree.  There are many ways to display the
non-decodable filename to the user because the user is not a machine.
The computer must know the unique sequence of bytes in order to access a
file. The user, OTOH, usually only needs to know that the file exists.
In most GUI-based end-user oriented desktop apps, it's enough to do
str(filename, errors='replace').  For instance, the GNOME file manager
displays:
  ? (Invalid encoding)
and Konqueror, the KDE file manager just displays:
  ?

The file can still be displayed this way, accessed via the raw bytes
that the program keeps internally, and operated upon by applications.

For applications in which the user needs more information to
differentiate the files the program has the option to display the raw
byte sequences as if they were the filename.  The *NIX shell and command
line tools have this ability.

$ LANG=en_US.utf8 ls -b
á
í
$ LANG=C ls -b
.
..
\303\241
\303\255
$ mv $'\303\241' $'\303\263'
$ LANG=C ls -b
\303\255
\303\263
$ LANG=en_US.utf8 ls -b
í
ó

 Linux is moving towards a standard of UTF-8 for filenames, and once we
 get to the point where the idea of encoding filenames and environment
 variables any other way is seen as crazy, then the Python 3 approach
 will work seamlessly.
 
nod  With the caveat that I haven't seen movement by Linux and other
Unix variants to enforce UTF-8.  What I have seen are statements by
kernel programmers that having the filesystem use bytes and not know
about encoding is the correct thing to do.

This means that utf-8 will be a convention rather than a necessity for a
very long time and consequently programs will need to worry about the
problems of mixed encoding systems for an equally long time.  (Remember,
encoding is something that can be changed per user and per file.  So on
a multiuser OS, mixed encodings can be out of the control of the system
administrator for perfectly valid reasons.)

 In the meantime, raw bytes APIs will provide an alternative for those
 that disagree with that philosophy.
 
Oh I agree with the UTF-8 everywhere philosophy.  I just know that
there's tons of real-world systems out there that don't conform to my
expectations for sanity and my code has to account for those :-)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Toshio Kuratomi
Bugbee, Larry wrote:
 There has been some discussion here that users should use the str or
 byte function variant based on what is relevant to their system, for
 example when getting a list of file names or opening a file.  That
 thought process really doesn't do much for those of us that write code
 that needs to run on any platform type, without alteration or the
 addition of complex if-statements and/or exceptions.
 
 Whatever the resolution here, and those of you addressing this thorny
 issue have my admiration, the solution should be such that it gives
 consistent behavior regardless of platform type and doesn't require the
 programmer to know of all the minute details of each possible target
 platform.  
 
I've been thinking about this and I can only see one option.  I don't
think that it really makes less work for the programmer, though -- it
just shifts the problem and makes it more apparent what your code is doing.

To avoid exceptions and if-then's in program code when accessing
filenames, environment variables, etc, you would need to access each of
these resources via the byte API.  Then, to avoid having to keep track
of what's a string and what's a byte in your other code, you probably
want to convert those bytes to strings.  This is where the burden gets
shifted.  You'll have your own routine(s) to do the conversion and have
to have exception handling code to deal with undecodable filenames.

Note 1: your particular app might be able to get away without doing the
conversion from bytes to string -- it depends on what you're planning on
doing with the filename/environment data.

Note 2: If there isn't a parallel API on all platforms, for instance,
Guido's proposal to not have os.environb on Windows, then you'll still
have to have a platform specific check. (Likely you should try to access
os.evironb in this instance and if it doesn't exist, use os.environ
instead... and remember that you need to either change os.environ's data
into str type or change os.environb's data into byte type.)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Toshio Kuratomi
[EMAIL PROTECTED] wrote:
 
 On 06:07 am, [EMAIL PROTECTED] wrote:
 Most apps aren't file managers or ftp clients but when they interact
 with files (for instance, a file selection dialog) they need to be able
 to show the user all the relevant files.  So on an app-by-app basis the
 need for this is high.
 
 While I tend to agree emphatically with this, the *real* solution here
 is a path-abstraction library.

Why don't you send me some information offlist.  I'm not sure I agree
that a path-abstraction library can work correctly but if it can it
would be nice to have that at a level higher than the file-dialog
libraries that I was envisioning.

[snip]

 ... but that still
 doesn't help me identify when someone would expect that asking python
 for a list of all files in a directory or a specific set of files in a
 directory should, without warning, return only a subset of them.  In
 what situations is this appropriate behaviour?
 
 If you say listdir(unicode) on a POSIX OS, your program is saying I
 only know how to deal with unicode results from this function, so please
 only give me those..

No.  (explained below)

  If your program is smart enough to deal with
 bytes, then you would have asked for bytes, no?

Yes (explained below)

  Returning only
 filenames which can be properly decoded makes sense.  Otherwise everyone
 needs to learn about this highly confusing issue, even for the simplest
 scripts.

os.listdir(unicode) (currently) means that the *programmer* is asking
that the stdlib return the decodable filenames from this directory.  The
question is whether the programmer understood that this is what they
were asking for and whether it is what they most likely want.  I would
make the following statements WRT to this:

1) The programmer most likely does not want decodable filenames and only
decodable filename.  If they were, we'd see a lot of python2.x code that
turns pathnames into unicode and discards everything that wasn't
decodable.  No one has given a use case for finding only the *decodable*
subset of files.  If I request to see all *.py files in a directory, I
want to see all of the *.py files in the directory, decodable or not.
If you can show how programmers intend 90% of their calls to
os.listdir()/glob.glob('*.txt') to show only the decodable subset of the
results, then the foundation of my arguments is gone.  So please, give
examples to prove this wrong.

  - If this is true, a definition of os.listdir(type 'str') that would
better meet programmer expectation would be: Give me all files in a
directory with the output as str type.  The definition of
os.listdir(type 'bytes') would be Give me all files in a directory
with the output as bytes type.  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.

2) For the programmer to understand the difference between
os.listdir(type 'bytes') and os.listdir(type 'str') they have to
understand the highly confusing issue and what it means for their
code.  So the current method is forcing programmers to understand it
even for the simplest scripts if their environment is not uniform with
no clue from the interpreter that there is an issue.

  - Similarly, raising an exception on undecodable values means that the
programmer can ignore the issue in any scripts in sane environments and
will be told that they need to deal with it (via an exception) when
their script runs in a non-sane environment.

3) The usage of unicode vs bytes is easy to miss for someone starting
with py2.x or windows and moving to a multi-platform or unix project.
Even simple testing won't reveal the problem unless the programmer knows
that they have to test what happens when encodings are mixed.  Once
again, this is requiring the programmer to understand the encoding issue
 without help from the interpreter.

 Skipping undecodable values is good enough that it will work 90% of the
 time.

You and Guido have now made this claim to defend not raising an
exception but I still don't have a use case.

Here are use cases that I see:

* Bill is coding an application for use inside his company.  His company
only uses utf-8.  His code naively uses os.listdir(type 'str').

  - The code does not throw an exception whether we use the current
os.listdir() or one that could throw an exception because the system
admins have sanitised the environment.  Bill did not need to understand
the implications of encoding for his code to work in this script whether
simple or complex.

* Mary is coding an application for use inside her company.  It finds
all html files on a system and updates her company's copyright, privacy
policy, and other legal boilerplate.  Her expectation is that after her
program runs every file will have been updated.  Her environment is a
mixture of different filename encodings due to having many legacy
documents for users in different locales.  Mary's code also naively uses
os.listdir(type 'str').  Her test case checks that the code does 

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Toshio Kuratomi
Guido van Rossum wrote:
 On Mon, Dec 8, 2008 at 12:07 PM,  [EMAIL PROTECTED] wrote:
 On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote:
 But I'm happy with just issuing a warning by default.  That would mean
 it doesn't fail silently, but neither does it crash.  Seems like the
 best compromise with the broken nature of the real world IT
 environment.
 
 OK, I can live with that too.
 
Same here.  This lets the application specify globally what should
happen (exception, warning, ignore via the warnings filters) and should
give enough context that it doesn't become a mysterious error in the
program.

The per method addition of an errors argument so that this isoverridable
locally as well as globally is also a nice touch but can be done
separately from this step.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread Toshio Kuratomi
James Y Knight wrote:
 On Dec 9, 2008, at 6:04 AM, Anders J. Munch wrote:
 The typical application will just obliviously use os.listdir(dir) and
 get the default elide-and-warn behaviour for un-decodable names. That
 rare special application
 
 I guess this is a new definition of rare special application: an
 application which deals with user-specified files.
 
 This is the problem I see in having two parallel APIs: people keep
 saying most applications can just go ahead and use the [broken] unicode
 string API. If there was a unicode API and a bytes API, but everyone
 was clear that always use the bytes API is the right thing to do,
 that'd be okay... But, since even python-dev members are saying that
 only a rare special app needs to care about working with users' existing
 files, I'm rather worried this API design will cause most programs
 written in python to be broken. Which seems a shame.
 
I agree with you which was part of why I raised this subject but I also
think that using the warnings module to issue a warning and ignore the
entire problematic entry is a reasonable compromise.  Hopefully it will
become obvious to people that it's a python3 wart at some point in the
future and we'll re-examine the default.  But until then, having a
printed warning that individual apps can turn into an exception seems
like it is less broken than the other alternatives the rare special
application people can live with :-)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Toshio Kuratomi
Adam Olsen wrote:
 On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
 Unfortunately, even programmers experienced in I18N like Martin, and
 those with intuition-that-has-the-force-of-lawwink like Guido,
 express deliberate disbelief on this point.  They say that filesystem
 names and environment variable values are text, which is true from the
 semantic viewpoint but can't be fully supported by any implementation.
 
 With all the focus on backup tools and file managers I think we've
 lost perspective.  They're an important use case, but hardly the
 dominant one.
 
 Please, as a user, if your app is creating new files, do NOT use
 bytes!  You have no excuse for creating garbage, and garbage doesn't
 help the user any.  Getting the encoding right, use the unicode APIs,
 and don't pass the buck on to everything else.
 
Uhmmm That's good advice but doesn't solve any problems :-(.  No
matter what I create, the filenames will be bytes when the next person
reads them in.  If my locale is shift-js and the person I'm sharing the
file with uses utf-8 things won't work.  Even if my locale is utf-8
(since I come from a European nation) and their locale is utf-16
(because they're from an Asian nation) the Unicode API won't work.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Toshio Kuratomi
Adam Olsen wrote:

 A half-broken setup is still a broken setup.  Eventually you have to
 tell people to stop screwing around and pick one encoding.
 
But it's not a broken setup.  It's the way the world is because people
share things with each other.

 I doubt that UTF-16 is used very much (other than on windows).  I
 haven't found any statistics on what distros use, but did find this
 one of the web itself:
 http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
 
UTF-16 is popular in Asian locales for the same reason that shift-js and
big-5 are hanging in there.  utf-8 takes many more bytes to encode Asian
Unicode characters than utf-16.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Toshio Kuratomi
Adam Olsen wrote:
 As a data point, firefox (when pointed at my home dir) DOES skip over
 garbage files.
 
 
That's not true.  However, it looks like Firefox is actually broken.
Take a look at this screenshot:
  firefox.png

That shows a directory with a folder that's not decodable in my utf-8
locale.  What's interesting to note is that I actually have two
nondecodable folders there but only one of them showed up.  So firefox
is inconsistent with its treatment, rendering some non-decodable files
and ignoring others.

Also interesting, if you point your browser at:
  http://toshio.fedorapeople.org/u/

You should see two other test files.  They're both
(one-half)(enyei).html but one's encoded in utf-8 and the other in
latin-1.  Firefox has some bugs in it related to this.  For instance, if
you mouseover the two links you'll see that firefox displays the same
symbolic names for each of the files (even though they're in two
different encodings).  Sometimes firefox is able to load both files and
sometimes it only loads one of them.  Firefox seems to be translating
the characters from ASCII percent encoding of bytes into their unicode
symbols and back to utf-8 in some circumstances related to whether it
has the pages in its cache or not.  In this case, it should be leaving
things as percent encoded bytes as it's the only way that apache is
going to know what to retrieve.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Toshio Kuratomi
Adam Olsen wrote:
 UTF-8 in percent encodings is becoming a defacto standard.  Otherwise
 the browser has to display the percent escapes in the address bar,
 rather than the intended text.
 
 IOW, inconsistent behaviour is a bug, but translating into UTF-8 is not. ;)
 
 
I think we should let this tangent drop because it's about bugs in
firefox bug, not in python :-)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] #Python3 ! ? (was Python Library Support in 3.x)

2010-06-21 Thread Toshio Kuratomi
On Mon, Jun 21, 2010 at 09:57:30AM -0400, Barry Warsaw wrote:
 On Jun 21, 2010, at 09:37 AM, Arc Riley wrote:
 
 Also, under where it mentions that most OS's do not include Python 3, it
 should be noted which have good support for it.  Gentoo (for example) has
 excellent support for Python 3, automatically installing Python packages
 which have Py3 support for both Py2 and Py3, and the python-based Portage
 package system runs cleanly on Py2.6, Py3.1 and Py3.2.
 
 We're trying to get there for Ubuntu (driven also by Debian).  We have Python
 3.1.2 in main for Lucid, though we will probably not get 3.2 into Maverick
 (the October 2010 release).  We're currently concentrating on Python 2.7 as a
 supported version because it'll be released by then, while 3.2 will still be
 in beta.
 
 If you want to help, or have complaints, kudos, suggestions, etc. for Python
 support on Ubuntu, you can contact me off-list.
 
nod Fedora 14 is about the same.  A nice to have thing that goes along
with these would be a table that has packages ported to python3 and which
distributions have the python3 version of the package.

Once most of the important third party packages are ported to python3 and in
the distributions, this table will likely become out-dated and probably
should be reaped but right now it's a very useful thing to see.

-Toshio


pgp4ovCkaMeKl.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] email package status in 3.X

2010-06-21 Thread Toshio Kuratomi
On Mon, Jun 21, 2010 at 11:43:07AM -0400, Barry Warsaw wrote:
 On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote:
 
 Something that may make sense to ease the porting process is for some
 of these on the boundary I/O related string manipulation functions
 (such as os.path.join) to grow encoding keyword-only arguments. The
 recommended approach would be to provide all strings, but bytes could
 also be accepted if an encoding was specified. (If you want to mix
 encodings - tough, do the decoding yourself).
 
 This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz
 for it.
 
 Would it make sense to have encoding-carrying bytes and str types?
 Basically, I'm thinking of types (maybe even the current ones) that carry
 around a .encoding attribute so that they can be automatically encoded and
 decoded where necessary.  This at least would simplify APIs that need to do
 the conversion.
 
 By default, the .encoding attribute would be some marker to indicated I have
 no idea, do it explicitly and if you combine ebytes or estrs that have
 incompatible encodings, you'd either throw an exception or reset the .encoding
 to IAmConfuzzled.  But say you had an email header like:
 
 =?euc-jp?b?pc+l7aG8pe+hvKXrpcmhqg==?=
 
 And code like the following (made less crappy):
 
 -snip snip-
 class ebytes(bytes):
 encoding = 'ascii'
 
 def __str__(self):
 s = estr(self.decode(self.encoding))
 s.encoding = self.encoding
 return s
 
 
 class estr(str):
 encoding = 'ascii'
 
 
 s = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa', 
 'euc-jp')
 b = bytes(s, 'euc-jp')
 
 eb = ebytes(b)
 eb.encoding = 'euc-jp'
 es = str(eb)
 print(repr(eb), es, es.encoding)
 -snip snip-
 
 Running this you get:
 
 b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa' ハローワールド! 
 euc-jp
 
 Would it be feasible?  Dunno.  Would it help ease the bytes/str confusion?
 Dunno.  But I think it would help make APIs easier to design and use because
 it would cut down on the encoding-keyword function signature infection.
 
I like the idea of having encoding information carried with the data.
I don't think that an ebytes type that can *optionally* have an encoding
attribute makes the situation less confusing, though.  To me the biggest
problem with python-2.x's unicode/bytes handling was not that it threw
exceptions but that it didn't always throw exceptions.  You might test this
in python2::
t = u'cafe'
function(t)

And say, ah my code works.  Then a user gives it this::
t = u'café'
function(t)

And get a unicode error because the function only works with unicode in the
ascii range.

ebytes seems to have the same pitfall where the code path exercised by your
tests could work with::
eb = ebytes(b)
eb.encoding = 'euc-jp'
function(eb)

but the user exercises a code path that does this and fails::
eb = ebytes(b)
function(eb)

What do you think of making the encoding attribute a mandatory part of
creating an ebyte object?  (ex: ``eb = ebytes(b, 'euc-jp')``).

-Toshio


pgpc4qEcxzofr.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 01:08:53AM +0900, Stephen J. Turnbull wrote:
 Lennart Regebro writes:
 
   2010/6/21 Stephen J. Turnbull step...@xemacs.org:
IMO, the UI is right.  Something like the above ought to work.
   
   Right. That said, many times when you want to do urlparse etc they
   might be binary, and you might want binary. So maybe the methods
   should work with both?
 
 First, a caveat: I'm a Unicode/encodings person, not an experienced
 web programmer.  My opinions on whether this would work well in
 practice should be taken with a grain of salt.
 
 Speaking for myself, I live in a country where the natives have
 saddled themselves with no less than 4 encodings in common use, and I
 would never want binary since none of them would display as anything
 useful in a traceback.  Wherever possible, I decode blobs into
 structured objects, I do it as soon as possible, and if for efficiency
 reasons I want to do this lazily, I store the blob in a separate
 .raw_object attribute.  If they're textual, I decode them to text.  I
 can't see an efficiency argument for decoding URIs lazily in most
 applications.
 
 In the case of structured text like URIs, I would create a separate
 class for handling them with string-like operations.  Internally, all
 text would be raw Unicode (ie, not url-encoded); repr(uri) would use
 some kind of readable quoting convention (not url-encoding) to
 disambiguate random reserved characters from separators, while
 str(uri) would produce an url-encoded string.  Converting to and from
 wire format is just .encode and .decode, then, and in this country you
 need to be flexible about which encoding you use.
 
 Agreed, this stuff is really annoying.  But I think that just comes
 with the territory.  PJE reports that folks don't like doing encoding
 and decoding all over the place.  I understand that, but if they're
 doing a lot of that, I have to wonder why.  Why not define the one
 line function and get on with life?
 
 The thing is, where I live, it's not going to be a one line function.
 I'm going to be dealing with URLs that are url-encoded representations
 of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047!  So I need an
 API that explicitly encodes and decodes.  And I need an API that
 presents Japanese as Japanese rather than as line noise.
 
 Eg, PJE writes
 
 Ugh.  I meant: 
 
 newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')
 
 Which just goes to the point of how ridiculous it is to have to  
 convert things to strings and back again to use APIs that ought to  
 just handle bytes properly in the first place. 
 
 But if you need that everywhere, what's so hard about
 
 def urljoin_wrapper (base, subdir):
 return urljoin(str(base, 'latin-1'), subdir).encode('latin-1')
 
 Now, note how that pattern fails as soon as you want to use
 non-ISO-8859-1 languages for subdir names.  In Python 3, the code
 above is just plain buggy, IMHO.  The original author probably will
 never need the generalization.  But her name will be cursed unto the
 nth generation by people who use her code on a different continent.
 
 The net result is that bytes are *not* a programmer- or user-friendly
 way to do this, except for the minority of the world for whom Latin-1
 is a good approximation to their daily-use unibyte encoding (eg, it's
 probably usable for debugging in Dansk, but you won't win any
 popularity contests in Tel Aviv or Shanghai).
 
One comment here -- you can also have uri's that aren't decodable into their
true textual meaning using a single encoding.

Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp
components inside of their path but the textual representation that was intended
will be garbled (or be represented by escaped byte sequences).  For that
matter, apache will serve requests that have no true textual representation
as it is working on the byte level rather than the character level.

So a complete solution really should allow the programmer to pass in uris as
bytes when the programmer knows that they need it.

-Toshio


pgpAvx546YBxD.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] email package status in 3.X

2010-06-21 Thread Toshio Kuratomi
On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
 At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
 What do you think of making the encoding attribute a mandatory part of
 creating an ebyte object?  (ex: ``eb = ebytes(b, 'euc-jp')``).
 
 As long as the coercion rules force str+ebytes (or str % ebytes,
 ebytes % str, etc.) to result in another ebytes (and fail if the str
 can't be encoded in the ebytes' encoding), I'm personally fine with
 it, although I really like the idea of tacking the encoding to bytes
 objects in the first place.
 
I wouldn't like this.  It brings us back to the python2 problem where
sometimes you pass an ebyte into a function and it works and other times you
pass an ebyte into the function and it issues a traceback.  The coercion
must end up with a str and no traceback (this assumes that we've checked
that the ebyte and the encoding match when we create the ebyte).

If you want bytes out the other end, you should either have a different
function or explicitly transform the output from str to bytes.

So, what's the advantage of using ebytes instead of bytes?

* It keeps together the text and encoding information when you're taking
  bytes in and want to give bytes back under the same encoding.
* It takes some of the boilerplate that people are supposed to do (checking
  that bytes are legal in a specific encoding) and writes it into the
  initialization of the object.  That forces you to think about the issue
  at two points in the code:  when converting into ebytes and when
  converting out to bytes.  For data that's going to be used with both
  str and bytes, this is the accepted best practice.  (For exceptions, the
  byte type remains which you can do conversion on when you want to).

-Toshio


pgpjsqwszNbF7.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] email package status in 3.X

2010-06-21 Thread Toshio Kuratomi
On Mon, Jun 21, 2010 at 02:46:57PM -0400, P.J. Eby wrote:
 At 02:58 AM 6/22/2010 +0900, Stephen J. Turnbull wrote:
 Nick alluded to the The One Obvious Way as a change in architecture.
 
 Specifically: Decode all bytes to typed objects (str, images, audio,
 structured objects) at input.  Do no manipulations on bytes ever
 except decode and encode (both to text, and to special-purpose objects
 such as images) in a program that does I/O.
 
 This ignores the existence of use cases where what you have is text
 that can't be properly encoded in unicode.  I know, it's a hard thing
 to wrap one's head around, since on the surface it sounds like
 unicode is the programmer's savior.  Unfortunately, real-world text
 data exists which cannot be safely roundtripped to unicode, and must
 be handled in bytes with encoding form for certain operations.
 
 I personally do not have to deal with this *particular* use case any
 more -- I haven't been at NTT/Verio for six years now.  But I do know
 it exists for e.g. Asian language email handling, which is where I
 first encountered it.  At the time (this *may* have changed), many
 popular email clients did not actually support unicode, so you
 couldn't necessarily just send off an email in UTF-8.  It drove us
 nuts on the project where this was involved (an i18n of an existing
 Python app), and I think we had to compromise a bit in some fashion
 (because we couldn't really avoid unicode roundtripping due to
 database issues), but the use case does actually exist.
 
 My current needs are simpler, thank goodness.  ;-)  However, they
 *do* involve situations where I'm dealing with *other*
 encoding-restricted legacy systems, such as software for interfacing
 with the US Postal Service that only works with a restricted subset
 of latin1, while receiving mangled ASCII from an ecommerce provider,
 and storing things in what's effectively a latin-1 database.  Being
 able to easily assert what kind of bytes I've got would actually let
 me catch errors sooner, *if* those assertions were being checked when
 different kinds of strings or bytes were being combined.  i.e., at
 coercion time).
 
While it's certainly possible that you have a grapheme that has no
corresponding unicode codepoint, it doesn't sound like this is the case
you're dealing with here.  You talk about restricted subset of latin1
but all of latin1's graphemes have unicode codepoints.  You also talk about
not being able to send off an email in UTF-8 but UTF-8 is an encoding of
unicode, not unicode itself.  Similarly, the statement that some email
clients don't support unicode isn't very clear as to actual problem.  The
email client supports displaying graphemes using glyphs present on the
computer.  As long as the graphemes needed have a unicode codepoint, using
unicode inside of your application and then encoding to bytes on the way out
works fine.

Even in cases where there's no unicode codepoint for the grapheme that
you're receiving unicode gives you a way out.  It provides you a private use
area where you can map the graphemes to unused codepoints.  Your
application keeps a mapping from that codepoint to the particular byte
sequence that you want.  Then write you a codec that converts from unicode w/
these private codepoints into your particular encoding (and from bytes into
unicode).

-Toshio


pgp0riTqgpAbp.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] email package status in 3.X

2010-06-21 Thread Toshio Kuratomi
On Mon, Jun 21, 2010 at 04:09:52PM -0400, P.J. Eby wrote:
 At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
 On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
  At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
  What do you think of making the encoding attribute a mandatory part of
  creating an ebyte object?  (ex: ``eb = ebytes(b, 'euc-jp')``).
 
  As long as the coercion rules force str+ebytes (or str % ebytes,
  ebytes % str, etc.) to result in another ebytes (and fail if the str
  can't be encoded in the ebytes' encoding), I'm personally fine with
  it, although I really like the idea of tacking the encoding to bytes
  objects in the first place.
 
 I wouldn't like this.  It brings us back to the python2 problem where
 sometimes you pass an ebyte into a function and it works and other times you
 pass an ebyte into the function and it issues a traceback.
 
 For stdlib functions, this isn't going to happen unless your ebytes'
 encoding is not compatible with the ascii subset of unicode, or the
 stdlib function is working with dynamic data...  in which case you
 really *do* want to fail early!
 
The ebytes encoding will often be incompatible with the ascii subset.
It's the reason that people were so often tempted to change the
defaultencoding on python2 to utf8.

 I don't see this as a repeat of the 2.x situation; rather, it allows
 you to cause errors to happen much *earlier* than they would
 otherwise show up if you were using unicode for your encoded-bytes
 data.
 
 For example, if your program's intent is to end up with latin-1
 output, then it would be better for an error to show up at the very
 *first* point where non-latin1 characters are mixed with your data,
 rather than only showing up at the output boundary!
 
That highly depends on your usage.  If you're formatting a comment on a web
page, checking at output and replacing with '?' is better than a traceback.
If you're entering key values into a database, then you likely want to know
where the non-latin1 data is entering your program, not where it's mixed
with your data or the output boundary.

 However, if you promoted mixed-type operation results to unicode
 instead of ebytes, then you:
 
 1) can't preserve data that doesn't have a 1:1 mapping to unicode, and
 
ebytes should be immutable like bytes and str.  So you shouldn't lose the
data if you keep a reference to it.

 2) can't detect an error until your data reaches the output point in
 your application -- forcing you to defensively insert ebytes calls
 everywhere (vs. simply wrapping them around a handful of designated
 inputs), or else have to go right back to tracing down where the
 unusable data showed up in the first place.
 
Usually, you don't want to know where you are combining two incompatible
strings.  Instead, you want to know where the incompatible strings are being
set in the first place.  If function(a, b) tracebacks with certain
combinations of a and b I need to know where a and b are being set, not
where function(a, b) is in the source code.  So you need to be making input
values ebytes() (or str in current python3) no matter what.

 One thing that seems like a bit of a blind spot for some folks is
 that having unicode is *not* everybody's goal.  Not because we don't
 believe unicode is generally a good thing or anything like that, but
 because we have to work with systems that flat out don't *do*
 unicode, thereby making the presence of (fully-general) unicode an
 error condition that has to be stamped out!
 
I think that sometimes as well.  However, here I think you're in a bit of
a blind spot yourself.  I'm saying that making ebytes + str coerce to ebytes
will only yield a traceback some of the time; which is the python2
behaviour.  Having ebytes + str coerce to str will never throw a traceback
as long as our implementation checks that the bytes and encoding work
together fro mthe start.

Throwing an error in code, only on some input is one of the main reasons
that debugging unicode vs byte issues sucks on python2.  On my box, with my
dataset, everything works.  Toss it up on pypi and suddenly I have a user in
Japan who reports that he gets a traceback with his dataset that he can't
give to me because it's proprietary, overly large, or transient.



 IOW, if you're producing output that has to go into another system
 that doesn't take unicode, it doesn't matter how
 theoretically-correct it would be for your app to process the data in
 unicode form.  In that case, unicode is not a feature: it's a bug.
 
This is not always true.  If you read a webpage, chop it up so you get
a list of words, create a histogram of word length, and then write the output as
utf8 to a database.  Should you do all your intermediate string operations
on utf8 encoded byte strings?  No, you should do them on unicode strings as
otherwise you need to know about the details of how utf8 encodes characters.

 And as it really *is* an error in that case, it should not pass
 silently

Re: [Python-Dev] email package status in 3.X

2010-06-21 Thread Toshio Kuratomi
On Mon, Jun 21, 2010 at 04:52:08PM -0500, John Arbash Meinel wrote:
 
 ...
  IOW, if you're producing output that has to go into another system
  that doesn't take unicode, it doesn't matter how
  theoretically-correct it would be for your app to process the data in
  unicode form.  In that case, unicode is not a feature: it's a bug.
 
  This is not always true.  If you read a webpage, chop it up so you get
  a list of words, create a histogram of word length, and then write the 
  output as
  utf8 to a database.  Should you do all your intermediate string operations
  on utf8 encoded byte strings?  No, you should do them on unicode strings as
  otherwise you need to know about the details of how utf8 encodes characters.
  
 
 You'd still have problems in Unicode given stuff like å =~ å even though
 u'\xe5' vs u'a\u030a' (those will look the same depending on your
 Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw
 with my current font shows the second as 2 characters.)
 
 I realize this was a toy example, but it does point out that Unicode
 complicates the idea of 'equality' as well as the idea of 'what is a
 character'. And just saying decode it to Unicode isn't really sufficient.
 
Ah -- but if you're dealing with unicode objects you can use the
unicodedata.normalize() function on them to come out with the right values.
If you're using bytes, it's yet another case where you, the programmer, have
to know what byte sequences represent combining characters in the particular
encoding that you're dealing with.

-Toshio


pgpF7cCCZvokU.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote:
 Toshio Kuratomi writes:
 
   One comment here -- you can also have uri's that aren't decodable into 
 their
   true textual meaning using a single encoding.
   
   Apache will happily serve out uris that have utf-8, shift-jis, and
   euc-jp components inside of their path but the textual
   representation that was intended will be garbled (or be represented
   by escaped byte sequences).  For that matter, apache will serve
   requests that have no true textual representation as it is working
   on the byte level rather than the character level.
 
 Sure.  I've never seen that combination, but I have seen Shift JIS and
 KOI8-R in the same path.
 
 But in that case, just using 'latin-1' as the encoding allows you to
 use the (unicode) string operations internally, and then spew your
 mess out into the world for someone else to clean up, just as using
 bytes would.
 
This is true.  I'm giving this as a real-world counter example to the
assertion that URIs are text.  In fact, I think you're confusing things
a little by asserting that the RFC says that URIs are text.  I'll address
that in two sections down.

   So a complete solution really should allow the programmer to pass
   in uris as bytes when the programmer knows that they need it.
 
 Other than passing bytes into a constructor, I would argue if a
 complete solution requires, eg, an interface that allows
 urljoin(base,subdir) where the types of base and subdir are not
 required to match, then it doesn't belong in the stdlib.  For stdlib
 usage, that's premature optimization IMO.
 
I'll definitely buy that.  Would urljoin(b_base, b_subdir) = bytes and
urljoin(u_base, u_subdir) = unicode be acceptable though?  (I think, given
other options, I'd rather see two separate functions, though.  It seems more
discoverable and less prone to taking bad input some of the time to have two
functions that clearly only take one type of data apiece.)

 The RFC says that URIs are text, and therefore they can (and IMO
 should) be operated on as text in the stdlib.

If I'm reading the RFC correctly, you're actually operating on two different
levels here.  Here's the section 2 that you quoted earlier, now in its
entirety::
2.  Characters

   The URI syntax provides a method of encoding data, presumably for the
   sake of identifying a resource, as a sequence of characters.  The URI
   characters are, in turn, frequently encoded as octets for transport or
   presentation.  This specification does not mandate any particular
   character encoding for mapping between URI characters and the octets used
   to store or transmit those characters.  When a URI appears in a protocol
   element, the character encoding is defined by that protocol; without such
   a definition, a URI is assumed to be in the same character encoding as
   the surrounding text.

   The ABNF notation defines its terminal values to be non-negative integers
   (codepoints) based on the US-ASCII coded character set [ASCII].  Because
   a URI is a sequence of characters, we must invert that relation in order
   to understand the URI syntax.  Therefore, the integer values used by the
   ABNF must be mapped back to their corresponding characters via US-ASCII
   in order to complete the syntax rules.

   A URI is composed from a limited set of characters consisting of digits,
   letters, and a few graphic symbols.  A reserved subset of those
   characters may be used to delimit syntax components within a URI while
   the remaining characters, including both the unreserved set and those
   reserved characters not acting as delimiters, define each component's
   identifying data.

So here's some data that matches those terms up to actual steps in the
process::

  # We start off with some arbitrary data that defines a resource.  This is
  # not necessarily text.  It's the data from the first sentence:
  data = b\xff\xf0\xef\xe0

  # We encode that into text and combine it with the scheme and host to form
  # a complete uri.  This is the URI characters mentioned in section #2.
  # It's also the sequence of characters mentioned in 1.1 as it is not
  # until this point that we actually have a URI.
  uri = bhttp://host/; + percentencoded(data)
  # 
  # Note1: percentencoded() needs to take any bytes or characters outside of
  # the characters listed in section 2.3 (ALPHA / DIGIT / - / . / _
  # / ~) and percent encode them.  The URI can only consist of characters
  # from this set and the reserved character set (2.2).
  #
  # Note2: in this simplistic example, we're only dealing with one piece of
  # data.  With multiple pieces, we'd need to combine them with separators,
  # for instance like this:
  # uri = b'http://host/' + percentencoded(data1) + b'/'
  # + percentencoded(data2)
  #
  # Note3: at this point, the uri could be stored as unicode or bytes in
  # python3.  It doesn't matter.  It will be a subset of ASCII in either
  # case.

  # Then we

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 08:31:13PM +0900, Stephen J. Turnbull wrote:
 Toshio Kuratomi writes:
   unicode handling redesign.  I'm stating my reading of the RFC not to defend
   the use case Philip has, but because I think that the outlook that non-text
   uris (before being percentencoded) are violations of the RFC
 
 That's not what I'm saying.  What I'm trying to point out is that
 manipulating a bytes object as an URI sort of presumes a lot about its
 encoding as text.

I think we're more or less in agreement now but here I'm not sure.  What
manipulations are you thinking about?  Which stage of URI construction are
you considering?

I've just taken a quick look at python3.1's urllib module and I see that
there is a bit of confusion there.  But it's not about unicode vs bytes but
about whether a URI should be operated on at the real URI level or the
data-that-makes-a-uri level.

* all functions I looked at take python3 str rather than bytes so there's no
  confusing stuff here
* urllib.request.urlopen takes a strict uri.  That means that you must have
  a percent encoded uri at this point
* urllib.parse.urljoin takes regular string values
* urllib.parse and urllib.unparse take regular string values

 Since many of the URIs we deal with are more or
 less textual, why not take advantage of that?

Cool, so to summarize what I think we agree on:

* Percent encoded URIs are text according to the RFC.
* The data that is used to construct the URI is not defined as text by the
  RFC.
* However, it is very often text in an unspecified encoding
* It is extremely convenient for programmers to be able to treat the data
  that is used to form a URI as text in nearly all common cases.

-Toshio


pgpDvecDxPAjV.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi
On Wed, Jun 23, 2010 at 09:36:45PM +0200, Antoine Pitrou wrote:
 On Wed, 23 Jun 2010 14:23:33 -0400
 Tres Seaver tsea...@palladion.com wrote:
  - - the slow adoption / porting rate of major web frameworks and libraries
to Python 3.
 
 Some of the major web frameworks and libraries have a ton of
 dependencies, which would explain why they really haven't bothered yet.
 
 I don't think you can't claim, though, that Python 3 makes things
 significantly harder for these frameworks. The proof is that many of
 them already give the user unicode strings in Python 2.x. They must
 have somehow got the decoding right.
 
Note that this assumption seems optimistic to me.  I started talking to Graham
Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste
do decoding of bytes to unicode at different layers which caused problems
for application level code that should otherwise run fine when being served
by mod_wsgi or paste httpserver.  That was the beginning of Graham starting
to talk about what the wsgi spec really should look like under python3
instead of the broken way that the appendix to the current wsgi spec states.

-Toshio


pgpRSbaUGJzcz.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi
On Wed, Jun 23, 2010 at 11:35:12PM +0200, Antoine Pitrou wrote:
 On Wed, 23 Jun 2010 17:30:22 -0400
 Toshio Kuratomi a.bad...@gmail.com wrote:
  Note that this assumption seems optimistic to me.  I started talking to 
  Graham
  Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste
  do decoding of bytes to unicode at different layers which caused problems
  for application level code that should otherwise run fine when being served
  by mod_wsgi or paste httpserver.  That was the beginning of Graham starting
  to talk about what the wsgi spec really should look like under python3
  instead of the broken way that the appendix to the current wsgi spec states.
 
 Ok, but the reason would be that the WSGI spec is broken. Not Python 3
 itself.
 
Agreed.  Neither python2 nor python3 is broken.  It's the wsgi spec and the
implementation of that spec where things fall down.  From your first post,
I thought you were claiming that python3 was broken since web frameworks got
decoding right on python2 and I just wanted to defend python3 by showing
that python2 wasn't all sunshine and roses.

-Toshio


pgp8xQXfAPrYT.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Licensing

2010-07-06 Thread Toshio Kuratomi
On Tue, Jul 06, 2010 at 10:10:09AM +0300, Nir Aides wrote:
 I take ...running off with the good stuff and selling it for profit to mean
 creating derivative work and commercializing it as proprietary code which 
 you
 can not do with GPL licensed code. Also, while the GPL does not prevent 
 selling
 copies for profit it does not make it very practical either.
 
Uhmmm http://finance.yahoo.com/q/is?s=RHTannual

It is very possible to make money with the GPL.  The GPL does, as you say,
prevents you from creating derivative works that are proprietary code.  It
does *not* prevent you from creating derivative works and commercializing
it.

-Toshio


pgpInicmKNFs3.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fixing #7175: a standard location for Python config files

2010-08-12 Thread Toshio Kuratomi
On Fri, Aug 13, 2010 at 07:48:22AM +1000, Nick Coghlan wrote:
 2010/8/12 Éric Araujo mer...@netwok.org:
  Choosing an arbitrary location we think is good on every system is fine
  and non risky I think, as long as Python let the various distribution
  change those paths though configuration.
 
  Don’t you have a bootstrapping problem? How do you know where to look at
  the sysconfig file that tells where to look at config files?

I'd hardcode a list of locations.
  [os.path.join(os.path.dirname(__file__), 'sysconfig.cfg'),
   os.path.join('/etc', 'sysconfig.cfg')]

The distributor has a limited choice of options on where to look.

A good alternative would be to make the config file overridable.  That way
you can have sysconfig.cfg next to sysconfig.py or in a known config
directory relative to the python stdlib install but also let the
distributions and individual sites override the defaults by making changes
to /etc/python3/sysconfig.cfg for instance.

 
 Personally, I'm not clear on what a separate syconfig.cfg file offers
 over clearly separating the directory configuration settings and
 continuing to have distributions patch sysconfig.py directly. The
 bootstrapping problem (which would encourage classifying synconfig.cfg
 as source code and placing it alongside syscongig.py) is a major part
 of that point of view.
 
Here's some advantages but some of them are of dubious worth:

* Allows users/site-administrators to change paths and not have packaging
  systems overwrite the changes.
* Makes it conceptually cleaner to make this overridable via user defined
  config files since  it's now a matter of parsing several config files
  instead of having a hardcoded value in the file and overridable values
  outside of it.
* Allows sites to add additional paths to the config file.
* Makes it clear to distributions that the values in the config file are
  available for making changes to rather than having to look for it in code
  and not know the difference between thaat or say, the encoding parameter
  in python2.
* Documents the format to use for overriding the paths if individual sites
  can override the defaults that are shipped in the system version of
  python.

-Toshio


pgpBEZ2XsDBy9.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] (Not) delaying the 3.2 release

2010-09-16 Thread Toshio Kuratomi
On Thu, Sep 16, 2010 at 09:52:48AM -0400, Barry Warsaw wrote:
 On Sep 16, 2010, at 11:28 PM, Nick Coghlan wrote:
 
 There are some APIs that should be able to handle bytes *or* strings,
 but the current use of string literals in their implementation means
 that bytes don't work. This turns out to be a PITA for some networking
 related code which really wants to be working with raw bytes (e.g.
 URLs coming off the wire).
 
 Note that email has exactly the same problem.  A general solution -- even if
 embodied in *well documented* best-practices and convention -- would really
 help make the stdlib work consistently, and I bet third party libraries too.
 
I too await a solution with abated breath :-) I've been working on
documenting best practices for APIs and Unicode and for this type of
function (take bytes or unicode and output the same type), knowing the
encoding is seems like a requirement in most cases:

http://packages.python.org/kitchen/designing-unicode-apis.html#take-either-bytes-or-unicode-output-the-same-type

I'd love to add another strategy there that shows how you can robustly
operate on bytes without knowing the encoding but from writing that, I think
that anytime you simplify your API you have to accept limitations on the
data you can take in.  (For instance, some simplifications can handle
anything except ASCII-incompatible encodings).

-Toshio


pgpAJSHDGRHtD.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] (Not) delaying the 3.2 release

2010-09-16 Thread Toshio Kuratomi
On Thu, Sep 16, 2010 at 10:56:56AM -0700, Guido van Rossum wrote:
 On Thu, Sep 16, 2010 at 10:46 AM, Martin (gzlist) gzl...@googlemail.com 
 wrote:
  On 16/09/2010, Guido van Rossum gu...@python.org wrote:
 
  In all cases I can imagine where such polymorphic functions make
  sense, the necessary and sufficient assumption should be that the
  encoding is a superset of 7-bit(*) ASCII. This includes UTF-8, all
  Latin-N variant, and AFAIK also the popular CJK encodings other than
  UTF-16. This is the same assumption made by Python's byte type when
  you use character-based methods like lower().
 
  Well, depends on what exactly you're doing, it's pretty easy to go wrong:
 
  Python 3.2a2+ (py3k, Sep 16 2010, 18:43:45) [MSC v.1500 32 bit (Intel)] on 
  win32
  Type help, copyright, credits or license for more information.
  import os, sys
  os.path.split(C:\\十)
  ('C:\\', '十')
  os.path.split(C:\\十.encode(sys.getfilesystemencoding()))
  (b'C:\\\x8f', b'')
 
  Similar things can catch out web developers once they step outside the
  percent encoding.
 
 Well, that character is not 7-bit ASCII. Of course things will go
 wrong there. That's the whole point of what I said, isn't it?
 
You were talking about encodings that were supersets of 7-bit ASCII.
I think Martin was demonstrating a byte string that was a superset of 7-bit
ASCII being fed to a stdlib function which went wrong.

-Toshio


pgpTUIwKWOepG.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] We should be using a tool for code reviews

2010-09-30 Thread Toshio Kuratomi
On Wed, Sep 29, 2010 at 01:23:24PM -0700, Guido van Rossum wrote:
 On Wed, Sep 29, 2010 at 1:12 PM, Brett Cannon br...@python.org wrote:
  On Wed, Sep 29, 2010 at 12:03, Guido van Rossum gu...@python.org wrote:
  A problem with that is that we regularly make matching improvements to
  upload.py and the server-side code it talks to. While we tend to be
  conservative in these changes (because we don't control what version
  of upload.py people use) it would be a pain to maintain backwards
  compatibility with a version that was distributed in Misc/ two years
  ago -- that's kind of outside our horizon.
 
  Well, I would assume people are working from a checkout. Patches from
  an outdated checkout simply would fail and that's fine by me.
 
 Ok, but that's an extra barrier for contributions. Lots of people when
 asked for a patch just modify their distro in place and you can count
 yourself lucky if they send you a diff from a clean copy.
 
 But maybe with Hg it's less of a burden to ask people to use a checkout.
 
  How often do we even get patches generated from a downloaded copy of
  Python? Is it enough to need to worry about this?
 
 I used to get these frequently. I don't know what the experience of
 the current crop of core developers is though, so maybe my gut
 feelings here are outdated.
 
When helping out on a Linux distribution, dealing with patches against the
latest tarball is a fairly frequent occurrence.  The question would be
whether these patches get filtered through the maintainer of the package
before landing in roundup/rietveld and whether the distro maintainer is
sufficiently in tune with python development that they're maintaining both
patches against the last tarball and a checkout of trunk with the patches
applied intelligently there.

A few other random thoughts:

* hg could be more of a burden in that it may be unfamiliar to the casual
  python user who happens to have found a fix for a bug and wants to submit
  it.  cvs and svn are similar enough that people comfortable with one are
  usually comfortable with the other but hg has different semantics.
* The barrier to entry seems to be higher the less well integrated the tools
  are.  I occassionally try to contribute patches to bzr in launchpad and
  the integration there is horrid.  You end up with two separate streams of
  comments and you don't automatically get subscribed to both.  There's
  several UI elements for associating a branch with a bug but some of them
  are buggy (or else are very strict on what input they're expecting) while
  other ones are hard to find.  Since I only contribute a patch two or three
  times a year, I have to re-figure out the process each time I try to
  contribute.
* I like the idea of patch complexity being a measure of whether the patch
  needs to go into a code review tool in that it keeps simple things simple
  and gives more advanced tools to more advanced cases.  I dislike it in
  that for someone who's just contributing a patch to fix a problem that
  they're encountering which happens to be somewhat complex, they end up
  having to learn a lot about tools that they may never use again.
* It seems like code review will be a great aid to people who submit changes
  or review changes frequently.  The trick will be making it
  non-intimidating for someone who's just going to contribute changes
  infrequently.

-Toshio


pgpaYtl9m5J7d.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Distutils2 scripts

2010-10-08 Thread Toshio Kuratomi
On Fri, Oct 08, 2010 at 10:26:36AM -0400, Barry Warsaw wrote:
 On Oct 08, 2010, at 03:22 PM, Tarek Ziadé wrote:
 
 Yes that what I was thinking about -- I am not too worried about this,
 since every Linux  deals with the 'more than one python installed'
 case.
 
 Kind of. wink  but anyway...
 
  I'm in favor of add a top-level setup module that can be invoked using
  python -m setup   There will be three cases:
 
 Nice idea ! I wouldn't call it setup though, since it does many other
 things. I can't think of a good name yet, but I'd like such a script
 to express the idea that it can be used to:
 
 I like 'python -m setup' too.  It's a small step from the familiar thing
 (python setup.py) to the new and shiny thing, without being confusing.  And
 you won't have to worry about things like version numbers because the Python
 executable will already have that baked in.
 
 - query pypi
 - browse what's installed
 - install/remove projects
 - create releases and upload them
 
 pkg_manager ?
 
 No underscores, please. :)
 
 Actually, a decent wrapper script could just be called 'setup'.  My
 command-not-found on Ubuntu doesn't find a collision, or even close
 similarities.
 
Simple English names like this are almost never a good idea for commands.
A quick google for /usr/bin/setup finds that Fedora-derived distros have
a /usr/bin/setup as a wrapper for all the text-mode configuration tools.
And there's a derivative of opensolaris that has a /usr/bin/setup for
configuring the system the first time.

 I still like 'egg' as a command too.  There are no collisions that I can see.
 I know this has been thrown around for years, and it's always been rejected
 because I think setuptools wanted to claim it, but since it still doesn't
 exist afaict, distutils2 could easily use it.
 
There's a 2D graphics library that provides a /usr/bin/egg command:
  http://www.ir.isas.jaxa.jp/~cyamauch/eggx_procall/
Latest Stable Version 0.93r3 (released 2010/4/14)

In the larger universe of programs, it might make for more intuitive
remembering of the command to use a prefix (either py or python) though.

python-setup  is a lot like python setup.py
pysetup is shorter
pyegg is even shorter :-)

-Toshio


pgpVyH77xDEyw.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Distutils2 scripts

2010-10-08 Thread Toshio Kuratomi
On Fri, Oct 08, 2010 at 05:12:44PM +0200, Antoine Pitrou wrote:
 On Fri, 8 Oct 2010 11:04:35 -0400
 Toshio Kuratomi a.bad...@gmail.com wrote:
  
  In the larger universe of programs, it might make for more intuitive
  remembering of the command to use a prefix (either py or python) though.
  
  python-setup  is a lot like python setup.py
  pysetup is shorter
  pyegg is even shorter :-)
 
 Wouldn't quiche be a better alternative for pyegg?
 
I won't bikeshed as long as we stay away from conflicting names.

-Toshio


pgpk9LAmigC2q.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] My work on Python3 and non-ascii paths is done

2010-10-21 Thread Toshio Kuratomi
On Thu, Oct 21, 2010 at 12:00:40PM -0400, Barry Warsaw wrote:
 On Oct 20, 2010, at 02:11 AM, Victor Stinner wrote:
 
 I plan to fix Python documentation: specify the encoding used to decode all 
 byte string arguments of the C API. I already wrote a draft patch: issue 
 #9738. This lack of documentation was a big problem for me, because I had to 
 follow the function calls to get the encoding.
 
This will be truly excellent!

 That's exactly what I was looking for!  Thanks.  I think you've learned a huge
 amount of good information that's difficult to find, so writing it up in a
 more permanent and easy to find location will really help future Python
 developers!
 
One further thing I'd be interested in is if you could document any best
practices from this experience.  Things like, surrogateescape is a good/bad
default in these cases,  When is parallel functions for bytes and str
better than a single polymorphic function?  That way when other modules are
added to the stdlib, things can be more consistent.

-Toshio


pgp6M2nRKwOkl.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Continuing 2.x

2010-10-29 Thread Toshio Kuratomi
On Fri, Oct 29, 2010 at 11:12:28AM -0700, geremy condra wrote:
 On Thu, Oct 28, 2010 at 11:55 PM, Glyph Lefkowitz
  Let's take PyPI numbers as a proxy.  There are ~8000 packages with a
  Programming Language::Python classifier.  There are ~250 with Programming
  Langauge::Python::3.  Roughly speaking, we can say that is 3% of Python
  code which has been ported so far.  Python 3.0 was released at the end of
  2008, so people have had roughly 2 years to port, which comes up with 1.5%
  per year.
 Just my two cents:
 
Just one further informational note about using pypi in this way for
statistics... In the porting work we've done within Fedora, I've noticed
that a lot of packages are python3 ready or even officially support python3
but the language classifier on pypi does not reflect this.  Here's just
a few since I looked them up when working on the python porting wiki pages:

http://pypi.python.org/pypi/Beaker/
http://pypi.python.org/pypi/pycairo
http://pypi.python.org/pypi/docutils

-Toshio


pgphZAiUVGy6C.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Breaking undocumented API

2010-11-08 Thread Toshio Kuratomi
On Tue, Nov 09, 2010 at 11:46:59AM +1100, Ben Finney wrote:
 Ron Adam r...@ronadam.com writes:
 
  def _publicly_documented_private_api():
Not sure why you would want to do this
   instead of using comments.
  
  ...
 
 Because the docstring is available at the interpreter via ‘help()’, and
 because it's automatically available to ‘doctest’, and most of the other
 good reasons for docstrings.
 
  The _publicly_documented_private_api() is a problem because people
  *will* use it even though it has a leading underscore. Especially
  those who are new to python.
 
 That isn't an argument against docstrings, since the problem you
 describe isn't dependent on the presence or absence of docstrings.
 
Just wanted to expand a bit here:  as a general practice, you may be
involved in a project where the _private_api() is not intended by people
outside of the project but is intended to be used in multiple places within
the project.  If you have different people working on those different areas,
it can be very useful for them to be able to use help(_private_api) on the
other functions from within the interpreter shell.

-Toshio


pgpG39YJbm42M.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Breaking undocumented API

2010-11-09 Thread Toshio Kuratomi
On Tue, Nov 09, 2010 at 01:49:01PM -0500, Tres Seaver wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 11/08/2010 06:26 PM, Bobby Impollonia wrote:
 
  This does hurt because anyone who was relying on import * to get a
  name which is now omitted from __all__ is going to upgrade and find
  their program failing with NameErrors. This is a backwards compatible
  change and shouldn't happen without a deprecation warning first.
 
 Outside an interactive prompt, anyone using from foo import * has set
 themselves and their users up to lose anyway.
 
 That syntax is the single worst misfeature in all of Python.  It impairs
 readability and discoverability for *no* benefit beyond one-time typing
 convenience.  Module writers who compound the error by expecting to be
 imported this way, thereby bogarting the global namespace for their own
 purposes, should be fish-slapped. ;)
 
I think there's a valid case for bogarting the namespace in this instance,
but let me know if there's a better way to do it::

# Method to use system libraries if available, otherwise use a bundled copy,
# aka: make both system packagers and developers happy::


Relevant directories and files for this module::

+ foo/
+- __init__.py
++ compat/
 +- __init__.py
 ++ bar/
  +- __init__.py
  +- _bar.py

foo/compat/bar/_bar.py is a bundled module.

foo/compat/bar/__init__.py has:

try:
from bar import *
from bar import __all__
except ImportError::
from foo.compat.bar._bar import *
from foo.compat.bar._bar import __all__

-Toshio


pgp2MughtFdu4.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Porting Ideas

2010-12-01 Thread Toshio Kuratomi
On Wed, Dec 01, 2010 at 10:06:24PM -0500, Alexander Belopolsky wrote:
 On Wed, Dec 1, 2010 at 9:53 PM, Terry Reedy tjre...@udel.edu wrote:
 ..
  Does Sphinx run on PY3 yet?
 
 It does, but see issue10224 for details.
 
  http://bugs.python.org/issue10224

Also, docutils has an unported module.

/me needs to write a bug report for that as he really doesn't have the time
he thought he did to perform the port.

-Toshio


pgplgIh22rxh1.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 384 accepted

2010-12-04 Thread Toshio Kuratomi
On Fri, Dec 03, 2010 at 11:52:41PM +0100, Martin v. Löwis wrote:
 Am 03.12.2010 23:48, schrieb Éric Araujo:
  But I'm not interested at all in having it in distutils2. I want the
  Python build itself to use it, and alas, I can't because of the freeze.
  You can’t in 3.2, true.  Neither can you in 3.1, or any previous
  version.  If you implement it in distutils2, you have very good chances
  to get it for 3.3.  Isn’t that a win?
 
 It is, unfortunately, a very weak promise. Until distutils2 is
 integrated in Python, I probably won't spend any time on it.
 
At the language summit it was proposed and seemed generally accepted (maybe
I took silence as consent... it's been almost a year now) that bold new
modules (and bold rewrites of existing modules since it fell out of the
distutils/2 discussion) should get implemented in a module on pypi before
being merged into the python stdlib.  If you wouldn't want to work on any of
those modules until they were actually integrated into Python, it sounds
like you disagree with that as a general practice?

-Toshio


pgpBIM4lN9FET.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi
On Wed, Jan 19, 2011 at 04:40:24PM -0500, Terry Reedy wrote:
 On 1/19/2011 4:05 PM, Simon Cross wrote:
 
 I have no problem with non-ASCII module identifiers being valid
 syntax. It's a question of whether attempting to translate a non-ASCII
 
 If the names are the same, ie, produced with the same sequence of
 keystrokes in the save-as box and importing box, then there is no
 translation, at least from the user's view.
 
 module name into a file name (so the file can be imported) is a good
 idea and whether these sorts of files can be safely transferred among
 diverse filesystems.
 
 I believe we now have the situation that a package that works on *nix
 could fail on Windows, whereas I believe that patch would *improve*
 portability.
 
I'm not so sure about this  You may have something that works on Windows
and on *NIX under certain circumstances but it seems likely to fail when
moving files between them (for instance, as packages downloaded from pypi).
Additionally, many unix filesystem don't specify a filesystem encoding for
filenames; they deal in legal and illegal bytes which could lead to
troubles.  This problem of which encoding to use is a problem that can be
seen on UNIX systems even now.  Try this:

  echo 'print(hi)'  café.py
  convmv -f utf-8 -t latin1 café.py
  python3 -c 'import café'

ASCII seems very sensible to me when faced with these ambiguities.

Other options I can brainstorm that could be explored:

* Specify an encoding per platform and stick to that.  (So, for instance,
  all module names on posix platforms would have to be utf-8).  Force
  translation between encoding when installing packages (But that doesn't
  help for people that are creating their modules using their own build
  scripts rather than distutils, copying the files using raw tar, etc.)
* Change import semantics to allow specifying the encoding of the module on
  the filesystem (seems really icky).

-Toshio


pgpsh1AqAY9Vd.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi
On Wed, Jan 19, 2011 at 07:11:52PM -0500, James Y Knight wrote:
 On Jan 19, 2011, at 6:44 PM, Toshio Kuratomi wrote:
  This problem of which encoding to use is a problem that can be
  seen on UNIX systems even now.  Try this:
  
   echo 'print(hi)'  café.py
   convmv -f utf-8 -t latin1 café.py
   python3 -c 'import café'
  
  ASCII seems very sensible to me when faced with these ambiguities.
  
  Other options I can brainstorm that could be explored:
  
  * Specify an encoding per platform and stick to that.  (So, for instance,
   all module names on posix platforms would have to be utf-8).  Force
   translation between encoding when installing packages (But that doesn't
   help for people that are creating their modules using their own build
   scripts rather than distutils, copying the files using raw tar, etc.)
  * Change import semantics to allow specifying the encoding of the module on
   the filesystem (seems really icky).
 
 None of this is unique to import -- the same exact issue occurs with 
 open(u'café'). I don't see any reason why import café should be though of as 
 more of a problem, or treated any differently.
 
It's unique in several ways:

1) With open, you can specify a byte string::
   open(b'caf\xe9.py').read()

   I don't know of any way to do that with import.
   This is needed when the filename is not compatible with your current
   locale.

2) import assigns a name to the module that it imports whereas open lets the
   programmer assign the name.  So even if you can specify what to use as
   a byte string for this filename on this particular filesystem you'd still
   end up with some ugly pseudo-representation of bytes when attempting to
   access it in code::
   import caf\xe9

   caf\xe9.do_something()

-Toshio


pgp3UpXl83i8t.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi
On Thu, Jan 20, 2011 at 01:26:01AM +0100, Victor Stinner wrote:
 Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : 
  Additionally, many unix filesystem don't specify a filesystem encoding for
  filenames; they deal in legal and illegal bytes which could lead to
  troubles.  This problem of which encoding to use is a problem that can be
  seen on UNIX systems even now.
 
 If the system is not correctly configured, it is not a bug in Python,
 but a bug in the system config. Python relies on the locale to choose
 the filesystem encoding (sys.getfilesystemencoding()). Python uses this
 encoding to decode and encode all filenames.
 
Saying that multiple encodings on a single system is a misconfiguration
every time it comes up does not make it true.  There's been multiple
examples of how you can end up with multiple encodings of filenames on
a single system listed in past threads: multiple users with different
encodings for their locales, mounting remote filesystems, downloading
a file To the existing list I'd add getting a package from pypi --
neither tar nor zip files contain encoding information about the filenames.
Therefore if I create an sdist of a python module using non-ascii filenames
using a locale of latin1 and then upload to pypi, people downloading that
on a utf-8 using locale will end up not being able to use the module.

  * Specify an encoding per platform and stick to that.
 
 It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
 programs will use it.
 
The proposal is that you ignore that when talking about loading and creating
(I mentioned distutils because my thought was that distutils could grow the
ability to translate from the system locale to a chosen neutral encoding
when running setup.py any of the dist commands but that doesn't address the
issue when testing a module that you've just written so perhaps that's not
necessary.) python modules.  Python modules would have a set of defined
filesystem encodings per system.  This prevents getting a mixture of
encodings of modules and having things work in one location but fail when
used somewhere else.  Instead, you get an upfront failure until you correct
the encoding.

 Anyway, I don't see why it is a problem to have different encodings on
 different systems. Each system can use its own encoding. The bug that
 I'm trying to solve is a Python bug, not an OS bug.
 
There is no OS bug here.  There is perhaps an OS design flaw but it's not
a flaw that will be going away soon (in part, because the present OS
designers do not see it as an OS flaw... to them it's a bug in code that
attempts to build a simpler interface on top of it.)

  * Change import semantics to allow specifying the encoding of the module on
the filesystem (seems really icky).
 
 This is a very bad idea. I introduced PYTHONFSENCODING environment
 variable in Python 3.2, but then quickly removed it, because it
 introduced a lot of inconsistencies.
 
Thanks for getting rid of that, PYTHONFSENCODING is a bad idea because it
doesn't solve the underlying issues.  However, when I say specifying the
encoding of the module on the filesystem, I don't mean something global like
PYTHONFSENCODING -- I mean something at the python code level::

   import café encoded_as('latin1')

After thinking about this one, though, I don't think it will work either.
This takes care of importing modules where the fs encoding of the module is
known but it doesn't where the fs encoding may be translated between
platforms.  I believe that this could arise when untarring a module on
windows using winzip or similar that gives you the option of translating
from utf-8 bytes into bytes that have meaning as characters on that
platform, for instance.

Do you have a solution to the problem?  I haven't looked at your patch so
perhaps you have an ingenous method of translating from the unicode
representation of the module in the import statement to the bytes in
arbitrary encodings on the filesystem that I haven't thought of.  If you
don't, however, then really - ASCII-only seems like the sanest of the three
solutions I can think of.

-Toshio


pgpxKdCbo8dSk.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi
On Thu, Jan 20, 2011 at 03:51:05AM +0100, Victor Stinner wrote:
 For a lesson at school, it is nice to write examples in the
 mother language, instead of using raw english with ASCII identifiers
 and filenames.

Then use this::
   import cafe as café

When you do things this way you do not have to translate between unknown
encodings into unicode.  Everything is within python source where you have
a defined encoding.

Teaching students to write non-portable code (relying on filesystem encoding
where your solution is, don't upload to pypi anything that has non-ascii
filenames) seems like the exact opposite of how you'd want to shape a young
student's understanding of good programming practices.

 In a school, you can use the same configuration
 (encoding) on all computers.
 
In a school computer lab perhaps.  But not on all the students' and
professors' machines.  How many professors will be cursing python when they
discover that the example code that they wrote on their Linux workstation
doesn't work when the students try to use it in their windows computer lab?
How many students will be upset when the code they turn in runs on their
professor's test machine if the lab computers were booted into the Linux
partition but not if the they were booted into Windows?

 
* Specify an encoding per platform and stick to that.
   
   It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
   programs will use it.
   
  (...) This prevents getting a mixture of encodings of modules (...)
 
 If you have an issue with encodings, when have to fix it when you create
 a module (on disk), not when you load a module (it is too late).
 
It's not too late to throw a clear error of what's wrong.

  I haven't looked at your patch so
  perhaps you have an ingenous method of translating from the unicode
  representation of the module in the import statement to the bytes in
  arbitrary encodings on the filesystem that I haven't thought of.
 
 On Windows, My patch tries to avoid any conversion: it uses unicode
 everywhere.
 
 On other OSes, it uses the Python filesystem encoding to encode a module
 name (as it is done for any other operation on the filesystem with an
 unicode filename).
 
The other interfaces are somewhat of a red herring here.  As I wrote in
another email, importing modules has ramifications that open(), for
instance, does not.  Additionally, those other filesystem operations have
been growing the ability to take byte values and encoding parameters because
unicode translation via a single filesystem encoding is a good default but
not a complete solution.

I think that this problem demands a complete solution, however, and it seems
to me that limiting the scope of the problem is the most pleasant method to
accomplish this.  Your solution creates modules which aren't portable.  One
of my proposals creates python code which isn't portable.  The other one
suffers some of the same disadvantages as your solution in portability but
allows for tools that could automatically correct modules.

 --
 
 Python 3 supports bytes filename to be able to read/copy/delete
 undecodable filenames, filenames stored in a encoding different than the
 system encoding, broken filenames. It is also possible to access these
 files using PEP 383 (with surrogate characters). This is useful to use
 Python on an old system.
 
  If you don't, however, then really - ASCII-only seems like the sanest 
  of the three solutions I can think of.
 
 But a (Python 3) module is not supposed to have a broken filename. If it
 is the case, you have better to fix its name, instead of trying to fix
 the problem later (in Python).
 
We agree that there should not be broken module names.  However it seems we
very hotly disagree about the definition of that.  You think that if
a module is named appropriately on one system but is not portable to another
system, that's fine.  I think that portability between systems is very
important and sacrificing that so that someone can locally use a module with
non-ASCII characters doesn't have a justifiable reward.

 With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups),
 it is already possible to use non-ASCII module names.
 
Tangent: This is not true about Linux.  UTF-8 is a matter of the
interpretation of the filesystem bytes that the user specifies by setting
their system locale.  Setting system locale to ASCII for use in system-wide
scripts, is quite common as is changing locale settings in other parts of
the world (as I can tell you from the bug reports colleagues CC me on to fix
for the problems with unicode support in their python2 programs).  Allowing
module names incompatible with ascii without specifying an encoding will
just lead to bug reports down the line.

Relatively few programmers understand the difference between the python
unicode abstraction and the byte representations possible for those strings.
Allowing non-ascii characters in module filenames without specifying an

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi
On Wed, Jan 19, 2011 at 09:02:17PM -0800, Glenn Linderman wrote:
 On 1/19/2011 8:39 PM, Toshio Kuratomi wrote:
 
 use this::
 
import cafe as café
 
 When you do things this way you do not have to translate between unknown
 encodings into unicode.  Everything is within python source where you have
 a defined encoding.
 
 
 This is a great way of converting non-portable module names, if the module 
 ever
 leaves the bounds of its computer, and runs into problems there.
 
You're missing a piece here.  If you mandate ascii you can convert to
a unicode name using import as because python knows that it has ascii text
from the filesystem when it converts it to an abstract unicode string that
you've specified in the program text.  You cannot go the other way because
python lacks the information (the encoding of the filename on the
filesystem) to do the transformation.

 Your demonstration of such an easy solution to the concerns you raise 
 convinces
 me more than ever that it is acceptable to allow non-ASCII module names.  For
 those programmers in a single locale environment, it'll just work.  And for
 those not in a single locale environment, there is your above simple solution
 to achieve portability without changing large numbers of lines of code.
 
Does my demonstration that you can't do that mean that it's no longer
acceptable?  :-)

/me guesses that the relative merits of being forced to write portable code
vs convenience of writing a module name in your native script still has
a different balance than in mine, thus the smiley :-)

-Toshio


pgpVg5DKpRDXA.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Toshio Kuratomi
On Thu, Jan 20, 2011 at 12:51:29PM +0100, Victor Stinner wrote:
 Le mercredi 19 janvier 2011 à 20:39 -0800, Toshio Kuratomi a écrit :
  Teaching students to write non-portable code (relying on filesystem encoding
  where your solution is, don't upload to pypi anything that has non-ascii
  filenames) seems like the exact opposite of how you'd want to shape a young
  student's understanding of good programming practices.
 
 That was already discuted before: see PEP 3131.
 http://www.python.org/dev/peps/pep-3131/#common-objections
 
 If the teacher choose to use non-ASCII, (s)he is responsible to explain
 the consequences to his/her students :-)
 
It's not discussed in that PEP section.

The PEP section says this: People claim that they will not be able to use
a library if to do so they have to use characters they cannot type on their
keyboards.

Whether you can type it at your keyboard or not is not the problem here.
The problem is portability.  The students and professors are sharing code
with each other.  But because of a mixture of operating systems (let alone
locale settings), the code written by one partner is unable to run on the
computer of the other.

If non-ascii filenames without a defined encoding are considered a feature,
python cannot even issue a descriptive error when this occurs.  It can only
say that it could not find the module but not why.  A restriction on module
names to ascii only could actually state that module names are not allowed
to be non-ASCII when it encounters the import line.

   In a school, you can use the same configuration
   (encoding) on all computers.
   
  In a school computer lab perhaps.  But not on all the students' and
  professors' machines.  How many professors will be cursing python when they
  discover that the example code that they wrote on their Linux workstation
  doesn't work when the students try to use it in their windows computer lab?
 
 Because some students use a stupid or misconfigured OS, Python should
 only accept ASCII names?

Just a note -- you'll get much farther if you refrain from calling names.
It just makes me think that you aren't reading and understanding the issue
I'm raising.  My examples that you're replying to involve two properly
configured OS's.  The Linux workstations are configured with a UTF-8
locale.  The Windows OS's use wide character unicode.  The problem occurs in
that the code that one of the parties develops (either the students or the
professors) is developed on one of those OS's and then used on the other OS.

 So, why do Python 3 support non-ASCII
 filenames: it is very well known that non-ASCII filenames is the root in
 many troubles! Should we simply drop unicode support for all filenames?
 And maybe restrict bytes filenames to bytes in [0; 127]? Or better,
 restrict to [32; 126] (U+007f causes some troubles in some terminals).
 
If you want to argue that because python3 supports non-ascii filenames in
other code, then the logical extension is that the import mechanism should
support importing module names defined by byte sequences.  I happen to think
that import has a lot of differences between it and other filenames as I've
said three times now.

 I think that in 2011, non-ASCII filenames are well supported on all
 (modern) operating systems. Issues with non-ASCII filenames are OS
 specific and should be fixed by the user (the admin of the computer).
 
  Additionally, those other filesystem operations have
  been growing the ability to take byte values and encoding parameters because
  unicode translation via a single filesystem encoding is a good default but
  not a complete solution.
 
 If you are unable to configure correctly your system to decode/encode
 correctly filenames, you should just avoid non-ASCII characters in the
 module names.
 
This seems like an argument to only have unicode versions of all filesystem
operations.  Since you've been spearheading the effort to have bytes
versions of things that access filenames, environment variables, etc,
I don't think that you seriously mean that.  Perhaps there is a language
issue here.

 You only give theorical arguments: did you at least try to use non-ASCII
 module names on your system with Python 3.2? I suppose that it will just
 work and you will never notice that the unicode module name (on import
 café) in encoded to bytes.
 
Yes I did and I got it to fail a cornercase as I showed twice with the same
example in other posts.  However, I want to make clear here that the issue
is not that I can create a non-ascii filename and then import it.  The issue
is that I can create a non-ascii filename and then try to share it with the
usual tools and it won't work on the recipient's system.  (A tangent is
whether the recipient's system is physically distinct from mine or only has
a different environment on the same physical host.)

 It fails on on OSes using filesystem encodings other than UTF-8 (eg.
 Windows)... because of a Python bug, and I just asked if I have

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Toshio Kuratomi
On Thu, Jan 20, 2011 at 01:43:03PM -0500, Alexander Belopolsky wrote:
 On Thu, Jan 20, 2011 at 12:44 PM, Toshio Kuratomi a.bad...@gmail.com wrote:
  .. My examples that you're replying to involve two properly
  configured OS's.  The Linux workstations are configured with a UTF-8
  locale.  The Windows OS's use wide character unicode.  The problem occurs in
  that the code that one of the parties develops (either the students or the
  professors) is developed on one of those OS's and then used on the other OS.
 
 
 I re-read your posts on this thread, but could not find the examples
 that you refer to.

Examples might be a bad word in this context.  Victor was commenting on the
two brainstorm ideas for alternatives to ascii-only that I had.  One was:

* Mandate that every python module on a platform has a specific encoding
  (rather than the value of the locale)

The other was:
* allow using byte strings for import

I think that both ideas are inferior to mandating that every python module
filename is ascii.  From what I'm getting from Victor's posts is that he, at
least, considers the portability problems to be ignorable because dealing
with ambiguous file name encodings is something that he'd like to force
third party tools to deal with.

-Toshio


pgpdh2k6Fwv56.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Toshio Kuratomi
On Thu, Jan 20, 2011 at 03:27:08PM -0500, Glyph Lefkowitz wrote:
 
 On Jan 20, 2011, at 11:46 AM, Guido van Rossum wrote:
 Same here. *Most* code will never be shared, or will only be shared
 between users in the same community. When it goes wrong it's also a
 learning opportunity. :-)
 
 
 Despite my usual proclivity for being contrarian, I find myself in agreement
 here.  Linux users with locales that don't specify UTF-8 frankly _should_ have
 to deal with all kinds of nastiness until they can transcode their 
 filesystems.
  MacOS and Windows both have a right answer here and your third-party tools
 shouldn't create mojibake in your filenames.
 
However, if this is the consensus, it makes a lot more sense to pick utf-8
as *the* encoding for python module filenames on Linux.

Why UTF-8:

* UTF-8 can cover the whole range of unicode whereas most (all?) other
  locale friendly encodings cannot.
* UTF-8 is becoming a standard for Linux distributions whether or not Linux
  users are adopting it.
* Third party tools are gaining support for UTF-8 even when they aren't
  gaining support for generic encodings (If I read the spec on zip
  correctly, this is actually what's happening there).

Why not locale:
* Relying on locale is simply not portable.  If nothing prevents people from
  distributing a unicode filename then they will go ahead and do so.  If
  the result works (say, because it's utf-8 and 80% of the Linux userbase is
  using utf-8) then it will get packaged and distributed and people won't
  know that it's a problem until someone with a non-utf-8 locale decids to
  use it.
* Mixing of modules from different locales won't work.  Suppose that the
  system python installs the previous module.  The local site has other
  modules that it has installed using a different filename encoding.
  The users at the site will find that either one or hte other of the two
  modules won't work.
* Because of the portability problems you have no choice but to tell people
  not to distribute python modules with non-ASCII names.  This makes the use
  of unicode names second class indefintely (until the kernel devs decide
  that they're wrong to not enforce a filesystem encoding or Linux becomes
  irrelevant as a platform).
* If you can pick a set of encodings that are valid (utf-8 for Linux and
  MacOS, wide unicode for windows [I get the feeling from other parts of the
  conversation that Windows won't be so lucky, though]) tools to convert
  python names become easier to write.  If you restrict it far enough, you
  could even write tools/importers that automatically do the detection.

PS: Sorry for not replying immediately, the team I'm on is dealing with an
issue at my work and I'm also preparing for a conference later this week.

-Toshio


pgpq1C0qGW77C.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Toshio Kuratomi
On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote:
 On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:
  
  * If you can pick a set of encodings that are valid (utf-8 for Linux and
   MacOS
 
 HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right 
 here you've already broken Python modules on OSX.

Others have been saying that Mac OSX's HFS+ uses UTF-8.  But the question is
not whether UTF-16 or UTF-8 is used by HFS+.  It's whether you can sensibly
decide on an encoding from the type of system that is being run on.  This
could be querying the filesystem or a check on sys.platform or some other
method.  I don't know what detection the current code does.

On Linux there's no defined encoding that will work; file names are just
bytes to the Linux kernel so based on people's argument that the convention
is and should be that filenames are utf-8 and anything else is
a misconfigured system -- python should mandate that its module filenames on
Linux are utf-8 rather than using the user's locale settings.
 
 And as far as I know, Linux software/FS generally use NFC (I've already seen 
 this issue cause trouble)
 
Linux FS's are bytes with a small blacklist (so you can't use the NULL byte
in a filename, for instance).  Linux software would be free to use any
normal form that they want.  If one software used NFC and another used NFD,
the FS would record two separate files with two separate filenames.  Other
programs might or might not display this correctly.

Example:
zsh$ touch cafe
zsh$ python
Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) 
 import os
 import unicodedata
 a=u'café'
 b=unicodedata.normalize('NFC', a)
 c=unicodedata.normalize('NFD', a)
 open(b.encode('utf8'), 'w').close()
 open(c.encode('utf8'), 'w').close()
 os.listdir(u'.')
 [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', 
 u'people-etc-changes.sha256sum', u'caf\xe9']
 os.listdir('.')
 ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', 
 'people-etc-changes.sha256sum', 'caf\xc3\xa9']
 ^D

zsh$ ls -al .
drwxrwxr-x.  2 badger badger  4096 Jan 25 07:46 .
drwxr-xr-x. 17 badger badger  4096 Jan 24 18:27 ..
-rw-rw-r--.  1 badger badger 0 Jan 25 07:45 cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 café

zsh$ ls -al cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:45 cafe
zsh$ ls -al cafe?
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 cafe

Now in this case, the decomposed form of the filename is being displayed
incorrectly and the shell treats the decomposed character as two characters
instead of one.  However, when you view these files in dolphin (the KDE file
manager) you properly see café repeated twice.

-Toshio


pgp2jXsIKYdB7.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Toshio Kuratomi
On Wed, Jan 26, 2011 at 11:24:54AM +0900, Stephen J. Turnbull wrote:
 Toshio Kuratomi writes:
 
   On Linux there's no defined encoding that will work; file names are just
   bytes to the Linux kernel so based on people's argument that the convention
   is and should be that filenames are utf-8 and anything else is
   a misconfigured system -- python should mandate that its module filenames 
 on
   Linux are utf-8 rather than using the user's locale settings.
 
 This isn't going to work where I live (Tsukuba).  At the national
 university alone there are hundreds of pre-existing *nix systems whose
 filesystems were often configured a decade or more ago.  Even if the
 hardware and OS have been upgraded, the filesystems are usually
 migrated as-is, with OS configuration tweaks to accomodate them.  Many
 of them use EUC-JP (and servers often Shift JIS).  That means that you
 won't be able to read module names with ls, and that will make Python
 unacceptable for this purpose.  I imagine that in Russia the same is
 true for the various Cyrillic encodings.
 
Sure ... but with these systems, neither read-modules-as-locale or
read-modules-as-utf-8 are a good solution to work, correct?  Especially if
the OS does get upgraded but the filesystems with user data (and user
created modules) are migrated as-is, you'll run into situations where system
installed modules are in utf-8 and user created modules are shift-jis and so
something will always be broken.

The only way to make sure that modules work is to restrict them to ASCII-only
on the filesystem.  But because unicode module names are seen as
a necessary feature, the question is which way forward is going to lead to
the least brokenness.  Which could be locale... but from the python2
locale-related bugs that I get to look at, I doubt.

 I really don't think there is anything that can be done here except to
 warn people that Kids, these stunts are performed by highly-trained
 professionals.  Don't try this at home!  Of course they will anyway,
 but at least they will have been warned in sufficiently strong terms
 that they might pay attention and be able to recover when they run
 into bizarre import exceptions.
 
So on the subject of warnings... I think a reason it's better to pick an
encoding for the platform/filesystem rather than to use locale is because
people will get an error or a warning at the appropriate time if that's the
case -- the first time they attempt to create and import a module with
a filename that's not encoded in the correct encoding for the platform.
It's all very well to say: We wrote in the documentation on
http://docs.python.org/distutils/introduction.html#Choosing-a-name that only
ASCII names should be used when distributing python modules but if the
interpreter doesn't complain when people use a non-ASCII filename we all
know that they aren't going to look in the documentation; they'll try it and
if it works they'll learn that habit.  

-Toshio


pgpjrrsvd3wof.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Toshio Kuratomi
On Wed, Jan 26, 2011 at 11:12:02AM +0100, Martin v. Löwis wrote:
 Am 26.01.2011 10:40, schrieb Victor Stinner:
  Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
  Why not locale:
  * Relying on locale is simply not portable. (...)
  * Mixing of modules from different locales won't work. (...)
  
  I don't understand what you are talking about.
 
 I think by portability, he means moving files from one computer to
 another. He argues that if Python would mandate UTF-8 for all file
 names on Unix, moving files in such a way would support portability,
 whereas using the locale's filename might not (if the locale use a
 different charset on the target system).
 
 While this is technically true, I don't think it's a helpful way of
 thinking: by mandating that file names are UTF-8 when accessed from
 Python, we make the actual files inaccessible on both the source and
 the target system.
 
  I don't understand the relation between the local filesystem encoding
  and the portability. I suppose that you are talking about the
  distribution of a module to other computers. Here the question is how
  the filenames are stored during the transfer. The user is free to use
  any tool, and try to find a tool handling Unicode correctly :-) But it's
  no more the Python problem.
 
 There are cases where there is no real transfer, in the sense in which
 you are using the word. For example, with NFS, you can access the very
 same file simultaneously on two systems, with no file name conversion
 (unless you are using NFSv4, and unless your NFSv4 implementations
 support the UTF-8 mandate in NFS well).
 
 Also, if two users of the same machine have different locale settings,
 the same file name might be interpreted differently.
 
Thanks Martin, I think that you understand my view even if you don't share
it.

There's one further case that I am worried about that has no real
transfer.  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.

-Toshio


pgpRiKtOLoK13.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream

2011-03-01 Thread Toshio Kuratomi
On Wed, Mar 02, 2011 at 01:14:32AM +0100, Martin v. Löwis wrote:
  I think a PEP would help, but in this case I would request that before
  the PEP gets written (it can be a really short one!) somebody actually
  go out and get consensus from a number of important distros. Besides
  Barry, do we have any representatives of distros here?
 
 Matthias Klose represents Debian, Dave Malcolm represents Redhat,
 and Dirkjan Ochtman represents Gentoo.
 
I'm here from Fedora.

-Toshio


pgpvGuHioHuln.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream

2011-03-03 Thread Toshio Kuratomi
On Thu, Mar 03, 2011 at 09:55:25AM +0100, Piotr Ożarowski wrote:
 [Guido van Rossum, 2011-03-02]
  On Wed, Mar 2, 2011 at 4:56 AM, Piotr Ożarowski pi...@debian.org wrote:
   [Sandro Tosi, 2011-03-02]
   On Wed, Mar 2, 2011 at 10:01, Piotr Ożarowski pi...@debian.org wrote:
I co-maintain with Matthias a package that provides /usr/bin/python
symlink in Debian and I can confirm that it will always point to Python
2.X. We also do not plan to add /usr/bin/python2 symlink (and I guess
only accepted PEP can change that)
  
   Can you please explain why you NACK this proposed change?
  
   it encourages people to change /usr/bin/python symlink to point to
   python3.X which I'm strongly against (how can I tell that upstream
   author meant python3.X and not python2.X without checking the code?)
  
  But the same is already true for python2.X vs. python2.Y. Explicit is
  better than implicit etc. Plus, 5 years from now everybody is going to
  be annoyed that python still refers to some ancient unused version
  of Python.
 
 I don't really mind adding /usr/bin/python2 symlink just to clean Arch
 mess, but I do mind changing /usr/bin/python to point to python3 (and I
 can use the same argument - Explicit is better than implicit - if you
 need Python 3, say so in the shebang, right?). What I'm afraid of is
 when we'll add /usr/bin/python2, we'll start getting a lot of scripts
 that will have to be checked manually every time new upstream version is
 released because we cannot assume what upstream author is using at given
 point.
 
 If /usr/bin/python will be disallowed in shebangs on the other hand
 (and all scripts will use /usr/bin/python2, /usr/bin/python3,
 /usr/bin/python4 or /usr/bin/python2.6 etc.) I don't see a problem with
 letting administrators choose /usr/bin/python (right now not only
 changing it from python2.X to python3.X will break the system but also
 changing it from /usr/bin/pytohn2.X to /usr/bin/python2.Y will break it,
 and believe me, I know what I'm talking about (one of the guys at work
 did something like this once))
 
 [all IMHO, dunno if other Debian's python-defaults maintainers agree
 with me]

Thinking outside of the box, I can think of something that would satisfy
your requirements but I don't know how appropriate it is for upstream python
to ship with.  Stop shipping /usr/bin/python.  Ship python in an alternate
location like $LIBEXECDIR/python2.7/bin (I think this would be
/usr/lib/python2.7/bin on Debian and /usr/libexec/python2.7/bin on Fedora
which would both be appropriate) then configure which python version is
invoked by the user typing python by configuring PATH (a shell alias might
also work).  You could configure this with environment-modules[1]_ if Debian
supports using that in packaging.

Coupled with a PEP that recommends against using /usr/bin/python in scripts
and instead using /usr/bin/python$MAJOR, this might be sufficient.  OTOH, my
cynical side doubts that script authors read PEPs so it'll take either
upstream python shipping without /usr/bin/python or consensus among the
distros to ship without /usr/bin/python to reach the point where script
authors realize that they need to use /usr/bin/python{2,3} instead of
/usr/bin/python.

.. _[1]: http://modules.sourceforge.net/

-Toshio


pgp97oSsV2cOw.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream

2011-03-03 Thread Toshio Kuratomi
On Thu, Mar 03, 2011 at 09:11:40PM -0500, Barry Warsaw wrote:
 On Mar 03, 2011, at 02:17 PM, David Malcolm wrote:
 
 On a related note, we have a number of scripts packaged across the
 distributions with a shebang line that reads:
#!/usr/bin/env python
 which AIUI follows upstream recommendations.
 
 Actually, I think this is *not* a good idea for distro provided scripts.  For
 any Python scripts released by the distro, you know exactly which Python it
 should run on, so it's better to hard code it.  That way, if someone installs
 Python from source, or installs an experimental version of a new distro
 Python, it won't break their system.  Yes, this has happened to me.  Also,
 note that distutils/setuptools/distribute rewrite the shebang line when they
 install scripts.
 
 There was a proposal to change these when packaging them to hardcode the
 specific python binary:
 
 https://fedoraproject.org/wiki/Features/SystemPythonExecutablesUseSystemPython
 on the grounds that a packaged system script is expecting (and has been
 tested against) a specific python build.
 
 That proposal has not yet been carried out.  Ideally if we did this,
 we'd implement it as a postprocessing phase within rpmbuild, rather
 than manually patching hundreds of files.
 
 Note that this would only cover shebang lines at the tops of scripts.
 
 JFDI!
 
 FWIW, a quick grep reveals about two dozen such scripts in /usr/bin on
 Ubuntu.  We should fix these. ;)
 
Note, we were unable to pass Guideline changes to do this in Fedora.  Gory
details of the FPC meeting are at 16:15:03 (abadger1999 == me):
http://meetbot.fedoraproject.org/fedora-meeting/2009-08-19/fedora-meeting.2009-08-19-16.01.log.html

The mailing list thread where this was discussed is here:
http://lists.fedoraproject.org/pipermail/packaging/2009-July/006248.html

Note to dmalcolm: IIRC, that also means that the Feature page you point to
isn't going to happen either.  Barry -- if other distros adopted stronger
policies, then that might justify me taking this back to the Packaging
Committee.

-Toshio


pgpeLOL8uwMOh.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream

2011-03-03 Thread Toshio Kuratomi
On Thu, Mar 03, 2011 at 09:46:23PM -0500, Barry Warsaw wrote:
 On Mar 03, 2011, at 09:08 AM, Toshio Kuratomi wrote:
 
 Thinking outside of the box, I can think of something that would satisfy
 your requirements but I don't know how appropriate it is for upstream python
 to ship with.  Stop shipping /usr/bin/python.  Ship python in an alternate
 location like $LIBEXECDIR/python2.7/bin (I think this would be
 /usr/lib/python2.7/bin on Debian and /usr/libexec/python2.7/bin on Fedora
 which would both be appropriate) then configure which python version is
 invoked by the user typing python by configuring PATH (a shell alias might
 also work).  You could configure this with environment-modules[1]_ if Debian
 supports using that in packaging.
 
 I wonder if Debian's alternatives system would be appropriate for this?
 
 http://wiki.debian.org/DebianAlternatives
 


No, alternatives is really only useful for a very small class of problems
[1]_ and [2]_.  For this discussion there's an additional problem which is
that alternatives works by creating symlinks.  Piotr Ożarowski wants to make
/usr/bin/python not exist so that scripts would have to use either
/usr/bin/python3 or /usr/bin/python2.  If alternatives places a symlink
there, it defeats the purpose of avoiding that path in the package itself.

I will note, though that scripts that have /usr/bin/env and take the route
of setting the PATH would still fall victim to this.  I think that
environment-modules can also set up aliases.  If so, that wouldbe better
than setting PATH for finding and removing python without a version in
scripts.

One further note on this since one of the other messages here had
a reference to this that kinda rains on this parade:
http://refspecs.linux-foundation.org/LSB_4.1.0/LSB-Languages/LSB-Languages/pylocation.html

The LSB is a standard that Linux distributions may or may not follow --
unlike the FHS, the LSB goes beyond encoding what most distros already do to
things that they think people should do.  For instance, Debian derivatives
might find the software installation section of LSB[3]_ to be a bit... hard
to swallow.  Fedora provides a package which aims to make a fedora system
lsb compliant but doesn't install it by default since it drags in gobs of
packages that are otherwise not necessary on many systems.

However, it does specify /usr/bin/python so getting rid of /usr/bin/python
at the Linux distribution level might not reach universal aclaim.  A united
front from upstream python through the python package maintainers on the
Linux distros would probably be needed to get people thinking about making
this change... and we still would likely have the ability to add
/usr/bin/python back onto a system (for instance, as part of that lsb
package I mentioned earlier.)

.. [1]:
https://fedoraproject.org/wiki/Packaging:EnvironmentModules#Introduction
.. [2]:
http://fedoraproject.org/wiki/Packaging:Alternatives#Recommended_usage

.. [3]:
http://refspecs.linux-foundation.org/LSB_4.1.0/LSB-Core-generic/LSB-Core-generic/swinstall.html

-Toshio


pgpRUO8y9NO0L.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream

2011-03-04 Thread Toshio Kuratomi
On Fri, Mar 04, 2011 at 01:56:39PM -0500, Barry Warsaw wrote:
 
 I don't agree that /usr/bin/python should not be installed.  The draft PEP
 language hits the right tone IMHO, and I would favor /usr/bin/python pointing
 to /usr/bin/python2 on Debian, but primarily used only for the interactive
 interpreter.
 
 Or IOW, I still want users to be able to type 'python' at a shell prompt and
 get the interpreter.
 
Actually, my post was saying that these two can be decoupled.  ie: It's
possible to not have /usr/bin/python while still allowing users to type
python at a shell prompt and get the interpreter.

This is done by either redefining the PATH to include the directory that the
interpreter named python is in or by creating an alias for python to the
proper interpreter.

Using the environment-modules tools is one solution that operated in this
way.  It also, incidentally, would let each user of a system choose whether
python invoked python2 or python3 (and on Debian, which sub-version of
those).  A more hardcoded approach is to have the python package drop some
configuration into /etc/profile.d/ style directories where the distribution
places files that are run by default by the user's shell with the default
startup files.

-Toshio


pgpVTu9R21jxR.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 395: Module Aliasing

2011-03-05 Thread Toshio Kuratomi
On Fri, Mar 04, 2011 at 12:56:16PM -0500, Fred Drake wrote:
 On Fri, Mar 4, 2011 at 12:35 PM, Michael Foord
 fuzzy...@voidspace.org.uk wrote:
  That (below) is not distutils it is setuptools. distutils just uses
  `scripts=[...]`, which annoyingly *doesn't* work with setuptools.
 
 Right; distutils scripts are just sad.
 
 OTOH, entry-point based scripts are something setuptools got very,
 very right.  Probably not perfect, but... I've not yet needed anything
 different in practice.
 
Some of them can be annoying as hell when dealing with a system that also
installs multiple versions of a module.  But one could argue that's the
fault of setuptools' version handling rather than the entry-points
handling.

-Toshio


pgpUBRcxfWp3n.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support the /usr/bin/python2 symlink upstream

2011-03-07 Thread Toshio Kuratomi
On Tue, Mar 08, 2011 at 08:25:50AM +1000, Nick Coghlan wrote:
 On Tue, Mar 8, 2011 at 1:30 AM, Barry Warsaw ba...@python.org wrote:
  On Mar 04, 2011, at 12:00 PM, Toshio Kuratomi wrote:
 
 Actually, my post was saying that these two can be decoupled.  ie: It's
 possible to not have /usr/bin/python while still allowing users to type
 python at a shell prompt and get the interpreter.
 
 This is done by either redefining the PATH to include the directory that the
 interpreter named python is in or by creating an alias for python to the
 proper interpreter.
 
  I personally would prefer aliasing rather than $PATH manipulation.
 
 Toshio's suggestion wouldn't work anyway - the /usr/bin/env python
 idiom will pick up a python alias no matter where it lives on $PATH.
 
I thought I pointed out that env wouldn't work with PATH but I guess I just
thought that silently in my head.  Pointing that out was going to live in
the same paragraph as saying that it does work with an alias::

$ sudo mv /usr/bin/python /usr/bin/python.bak
$ alias python='/usr/bin/python2.7'
$ python --version
Python 2.7
$ cat test.py
#! /bin/env python
print 'hi'
$ ./test.py
/bin/env: python: No such file or directory
$ mv /usr/bin/python.bak /usr/bin/python
$ ./test.py
hi


-Toshio


pgpwQudNGJDWc.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [PEPs] Support the /usr/bin/python2 symlink upstream

2011-03-08 Thread Toshio Kuratomi
On Tue, Mar 08, 2011 at 06:43:19PM -0800, Glenn Linderman wrote:
 On 3/8/2011 12:02 PM, Terry Reedy wrote:
 
 On 3/7/2011 9:31 PM, Reliable Domains wrote:
 
 
 The launcher need not be called python.exe, and maybe it would be
 better called #@launcher.exe (or similar, depending on its exact
 function details).
 
 
 I do not know that the '#@' part is about, but pygo would be short and
 expressive.
 
 
 
 If my proposal to make a line starting with #@ to be used instead of the Unix
 #! (#@ could be on the first or second line, to allow cross-platform scripts 
 to
 use both, and Windows only scripts to not have #!

You'd need to allow for it to be on the third line as well.  pep-0263
has already taken the second line if it's in a script that has a Unix
shebang.


 ), then #@launcher.exe (and #
 @launcherw.exe I suppose) would reflect the functionality of the launcher,
 which need not be tightly tied to Python, if it uses a separate line.  But the
 launcher should probably not be the thing invoked from the command line, only
 implicitly when running scripts by naming them as the first thing on the
 command line.
 
 I'm of the opinion that attempting to parse a Unix #! line, and intuit what
 would be the equivalent on Windows is unnecessarily complex and error prone,
 and assumes that the variant systems are configured using the same guidelines
 (which the Python community may espouse, but may not be followed by all
 distributions, sysadmins, or users).

I do not have a Windows system so I don't have a horse in this race but if
the argument is to avoid complexity, be careful that your proposed solution
isn't more complex than what you're avoiding.  ie::

 Now that I've had this idea, one might want to create other 2nd character
 codes after the Unix #! line... one could have
 
 #! Unix command processor
 #@ Windows command processor
 #$ OS/2 command processor
 #% Alternate Windows command processor.
 
 One could even port it to Unix:
 
 #!/usr/bin/#@launcher
 #@c:\python2.6\python.exe
 #^/usr/bin/python2.5
 #/usr/bin/mono/IronPython2.6 for .NET 4.0/ipy.exe
 #  I made up the line above, having no knowledge of Mono, but I think you get
 the idea
 
 Choice of command line would be an environment variable, I suppose, that the
 launcher would look at, or if none, then a system-specific default.  It would
 have to search forward in the file until it finds the appropriate prefix or a
 line not starting with #, or starting with #  or ##, at which point it
 would give up.
 
-Toshio


pgpkYA49vPaay.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Module version variable

2011-03-18 Thread Toshio Kuratomi
On Fri, Mar 18, 2011 at 07:40:43PM -0700, Guido van Rossum wrote:
 On Fri, Mar 18, 2011 at 7:28 PM, Greg Ewing greg.ew...@canterbury.ac.nz 
 wrote:
  Tres Seaver wrote:
 
  I'm not even sure why you would want __version__ in 99% of modules:  in
  the ordinary cases, a module's version should be either the Python
  version (for a module shipped in the stdlib), or the release of the
  distribution which shipped it.
 
  It's useful to be able to find out the version of a module
  you're using at run time so you can cope with API changes.
 
  I had a case just recently where the behaviour of something
  in pywin32 changed between one release and the next. I looked
  for an attribute called 'version' or something similar to
  test, but couldn't find anything.
 
  +1 on having a standard place to look for version info.
 
 I believe __version__ *is* the standard (like __author__). IIRC it was
 proposed by Ping. I think this convention is so old that there isn't a
 PEP for it. So yes, we might as well write it down. But it's really
 nothing new.
 
There is a section in PEP8 about __version__ but it serves a slightly
different purpose there:


Version Bookkeeping

If you have to have Subversion, CVS, or RCS crud in your source file, do
it as follows.

__version__ = $Revision: 88433 $
# $Source$

These lines should be included after the module's docstring, before any
other code, separated by a blank line above and below.


Personally, I've never found a need to access the repository revision
programatically from my pyhon applications but I have needed to access the
API version so it would make sense to me to change the meaning of
__version__.

-Toshio


pgpr66xyWCYYt.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Security implications of pep 383

2011-03-29 Thread Toshio Kuratomi
On Tue, Mar 29, 2011 at 07:23:25PM +0100, Michael Foord wrote:
 Hey all,
 
 Not sure how real the security risk is here:
 
 http://blog.omega-prime.co.uk/?p=107
 
 Basically  he is saying that if you store a list of blacklisted files
 with names encoded in big-5 (or some other non-utf8 compatible
 encoding) if those names are passed at the command line, or otherwise
 read in and decoded from an assumed-utf8 source with surrogate
 escaping, the surrogate escape decoded names will not match the
 properly decoded blacklisted names.
 
The example is correct.  The security risk is real.  However, there's a flaw
in the program and whether the question of whether there's also a flaw in
python is not so certain.

Here's the line I'd say is contentious::
  blacklist = open(blacklist.big5, encoding='big5').read().split()

The blacklist file contains a list of filenames.  However, this code treats
it as a list of strings.  This a logic error in the program, and he should
really be doing this::
  blacklist = open(blacklist.big5, 'rb').read().split()

Then, when comparing it against the values of sys.argv, either sys.argv gets
converted into bytes (using the system locale since that's what was used to
encode to unicode) or the items in blacklist get converted to unicode with
surrogateescape.

The possible flaw in python is this:  Code like the blog poster wrote passes
python3 without an error or a warning.  This gives the programmer no
feedback that they're doing something wrong until it actually bites them in
the foot in deployed code.

-Toshio


pgpZiD1gfinFR.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Security implications of pep 383

2011-03-29 Thread Toshio Kuratomi
On Tue, Mar 29, 2011 at 10:55:47PM +0200, Victor Stinner wrote:
 Le mardi 29 mars 2011 à 22:40 +0200, Lennart Regebro a écrit :
  The lesson here seems to be if you have to use blacklists, and you
  use unicode strings for those blacklists, also make sure the string
  you compare with doesn't have surrogates.
 
 No. '\u4f60\u597d'.encode('big5').decode('latin1') gives '§A¦n' which
 doesn't contain any surrogate character.
 
 The lesson is: if you compare Unicode filenames on UNIX, make sure that
 your system is correctly configured (the locale encoding must be the
 filesystem encoding).

You're both wrong :-)

Lennart is missing that you just need to use the same encoding
+ surrogateescape (or stick with bytes) for decoding the byte strings that
you are comparing.

You're missing that on UNIX there is no filesystem encoding so the idea of
locale and filesystem encoding matching is false (and unnecessary -- the
encodings that you use within python just need to be the same.  They don't
even need to match up to the reality of what's used on the filesystem or the
user's locale.)

-Toshio


pgpbDIzKAesS3.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Security implications of pep 383

2011-03-30 Thread Toshio Kuratomi
On Wed, Mar 30, 2011 at 08:36:43AM +0200, Lennart Regebro wrote:
 On Wed, Mar 30, 2011 at 07:54, Toshio Kuratomi a.bad...@gmail.com wrote:
  Lennart is missing that you just need to use the same encoding
  + surrogateescape (or stick with bytes) for decoding the byte strings that
  you are comparing.
 
 You lost me here. I need to do this for what?

The lesson here seems to be if you have to use blacklists, and you
use unicode strings for those blacklists, also make sure the string
you compare with doesn't have surrogates.


Really, surrogates are a red herring to this whole issue.  The issue is that
the original code was trying to compare two different transformations of
byte sequences and expecting them to be equal.  Let's say that you have the
following byte value::
  b_test_value = b'\xa4\xaf'

This is something that's stored in a file or the filename of something on
a unix filesystem or stored in a database or any number of other things.
Now you want to compare that to another piece of data that you've read in
from somewhere outside of python.  You'd expect any of the following to
work::
  b_test_value == b_other_byte_value
  b_test_value.encode('utf-8', 'surrogateescape') == 
b_other_byte_value('utf-8', 'surrogateescape')
  b_test_value.encode('latin-1') == b_other_byte_value('latin-1')
  b_test_value.encode('euc_jp') == b_other_byte_value('euc_jp')

You wouldn't expect this to work::
  b_test_value.encode('latin-1') == b_other_byte_value('euc_jp')

Once you see that, you realize that the following is only a specific case of
the former, surrogateescape doesn't really matter::
  b_test_value.encode('utf-8', 'surrogateescape') == 
b_other_byte_value('euc_jp')

-Toshio


pgpZiMIuYZION.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 396, Module Version Numbers

2011-04-07 Thread Toshio Kuratomi
On Wed, Apr 06, 2011 at 11:04:08AM +0200, John Arbash Meinel wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 
 ...
  #. ``__version_info__`` SHOULD be of the format returned by PEP 386's
 ``parse_version()`` function.
 
 The only reference to parse_version in PEP 386 I could find was the
 setuptools implementation which is pretty odd:
 
  
  In other words, parse_version will return a tuple for each version string, 
  that is compatible with StrictVersion but also accept arbitrary version and 
  deal with them so they can be compared:
  
  from pkg_resources import parse_version as V
  V('1.2')
  ('0001', '0002', '*final')
  V('1.2b2')
  ('0001', '0002', '*b', '0002', '*final')
  V('FunkyVersion')
  ('*funkyversion', '*final')
 
Barry -- I think we want to talk about NormalizedVersion.from_parts() rather
than parse_version().

 bzrlib has certainly used 'version_info' as a tuple indication such as:
 
 version_info = (2, 4, 0, 'dev', 2)
 
 and
 
 version_info = (2, 4, 0, 'beta', 1)
 
 and
 
 version_info = (2, 3, 1, 'final', 0)
 
 etc.
 
 This is mapping what we could sort out from Python's sys.version_info.
 
 The *really* nice bit is that you can do:
 
 if sys.version_info = (2, 6):
   # do stuff for python 2.6(.0) and beyond
 
nod  People like to compare versions and the tuple forms allow that.  Note
that the tuples you give don't compare correctly.  This is the order that
they sort:

(2, 4, 0)
(2, 4, 0, 'beta', 1)
(2, 4, 0, 'dev', 2)
(2, 4, 0, 'final', 0)

So that means, snapshot releases will always sort after the alpha and beta
releases (and release candidate if you use 'c' to mean release candidate).
Since the simple (2, 4, 0) tuple sorts before everything else, a comparison
that doesn't work with the 2.4.0-alpha (or beta or arbitrary dev snapshots)
would need to specify something like:

(2, 4, 0, 'z')

NormalizedVersion.from_parts() uses nested tuples to handle this better.
But I think that even with nested tuples a naive comparison fails since most
of the suffixes are prerelease strings.  ie: ((2, 4, 0),)  ((2, 4, 0),
('beta', 1))

So you can't escape needing a function to compare versions.
(NormalizedVersion does this by letting you compare NormalizedVersions
together).  Barry if this is correct, maybe __version_info__ is useless and
I shouldn't have brought it up at pycon?

-Toshio


pgpztjMBlMddF.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

2011-06-28 Thread Toshio Kuratomi
On Tue, Jun 28, 2011 at 03:46:12PM +0100, Paul Moore wrote:
 On 28 June 2011 14:43, Victor Stinner victor.stin...@haypocalc.com wrote:
  As discussed before on this list, I propose to set the default encoding
  of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
  open() is called without an explicit encoding and if the locale encoding
  is not UTF-8. Using the warning, you will quickly notice the potential
  problem (using Python 3.2.2 and -Werror) on Windows or by using a
  different locale encoding (.e.g using LANG=C).
 
 -1. This will make things harder for simple scripts which are not
 intended to be cross-platform.
 
 I use Windows, and come from the UK, so 99% of my text files are
 ASCII. So the majority of my code will be unaffected. But in the
 occasional situation where I use a £ sign, I'll get encoding errors,
 where currently things will just work. And the failures will be data
 dependent, and hence intermittent (the worst type of problem). I'll
 write a quick script, use it once and it'll be fine, then use it later
 on some different data and get an error. :-(

I don't think this change would make things harder.  It will just move
where the pain occurs.  Right now, the failures are intermittent on A)
computers other than the one that you're using. or B) intermittent when run
under a different user than yourself.  Sys admins where I'm at are
constantly writing ad hoc scripts in python that break because you stick
something in a cron job and the locale settings suddenly become C and
therefore the script suddenly only deals with ASCII characters.

I don't know that Victor's proposed solution is the best (I personally would
like it a whole lot more than the current guessing but I never develop on
Windows so I can certainly see that your environment can lead to the
opposite assumption :-) but something should change here.  Issuing a warning
like open used without explicit encoding may lead to errors if open() is
used without an explicit encoding would help a little (at least, people who
get errors would then have an inkling that the culprit might be an open()
call).  If I read Victor's previous email correctly, though, he said this
was previously rejected.

Another brainstorming solution would be to use different default encodings on
different platforms.  For instance, for writing files, utf-8 on *nix systems
(including macosX) and utf-16 on windows.  For reading files, check for a utf-16
BOM, if not present, operate as utf-8.  That would seem to address your
issue with detection by vim, etc but I'm not sure about getting £ in your
input stream.  I don't know where your input is coming from and how Windows
equivalent of locale plays into that.

-Toshio


pgp7J0rQuExcz.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream)

2011-08-12 Thread Toshio Kuratomi
On Fri, Aug 12, 2011 at 12:19:23PM -0400, Barry Warsaw wrote:
 On Aug 12, 2011, at 01:10 PM, Nick Coghlan wrote:
 
 1. Accept the reality of that situation, and propose a mechanism that
 minimises the impact of the resulting ambiguity on end users of Python
 by allowing developers to be explicit about their target language.
 This is the approach advocated in PEP 394.
 
 2. Tell the Arch developers (and anyone else inclined to point the
 python name at python3) that they're wrong, and the python symlink
 should, now and forever, always refer to a version of Python 2.x.
 
 FWIW, although I generally support the PEP, I also think that distros
 themselves have a responsibility to ensure their #! lines are correct, for
 scripts they install.  Meaning, if it requires rewriting the #! line on OS
 package install, so be it.
 
+1 with the one caveat... it's nice to upstream fixes.  If there's a simple
thing like python == python-2 and python3 == python-3 everywhere, this is
possible.  If there's something like python2 == python-2 and python-3 ==
python3 everywhere, this is also possible.  The problem is that: the latter
is not the case (python from python.org itself doesn't produce a python2
symlink on install) and historically the former was the case but since
python-dev rejected the notion that python == python-2 that is no long true.

As long as it's just Arch, there's still time to go with #2.  #1 is not
a complete solution (especially because /usr/bin/python2 will never exist on
some historical systems [not ones I run though, so someone else will need to
beat that horse :-)]) but is better than where we are now where there is no
guidance on what's right and wrong at all.

-Toshio


pgpBwoEJ5g8Bg.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Using PEP384 Stable ABI for the lzma extension module

2011-10-05 Thread Toshio Kuratomi
On Wed, Oct 05, 2011 at 06:14:08PM +0200, Antoine Pitrou wrote:
 Le mercredi 05 octobre 2011 à 18:12 +0200, Martin v. Löwis a écrit :
   Not sure what you are using it for. If you need to extend the buffer
   in case it is too small, there is absolutely no way this could work
   without copies in the general case because of how computers use
   address space. Even _PyBytes_Resize will copy the data.
  
   That's not a given. Depending on the memory allocator, a copy can be
   avoided. That's why the str += str hack is much more efficient under
   Linux than Windows, AFAIK.
  
  Even Linux will have to copy a block on realloc in certain cases, no?
 
 Probably so. How often is totally unknown to me :)
 
http://www.gnu.org/software/libc/manual/html_node/Changing-Block-Size.html

It depends on whether there's enough free memory after the buffer you
currently have allocated.  I suppose that this becomes a question of what
people consider the general case :-)

-Toshio


pgpCHlc9jDncJ.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >