Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Raymond Hettinger


[Scott David Daniels]

Comparison of three cases (including performance rations):
   MB/S MB/SMB/S
   in C  in py3k  in 2.7 C/3k 2.7/3k
** Text append **
 10M write 1e6 units at a time261.00 218.000 1540.000 1.20  7.06
 20K write one unit at a time   0.983  0.0811.33 12.08 16.34
400K write 20 units at a time  16.000  1.510   22.90 10.60 15.17
400K write 4096 units at a time   236.00 118.000 1244.000 2.00 10.54


Do you know why the text-appends fell off so much in the 1st and last cases?



** Text input **
 10M read whole contents at once   89.700 68.700  966.000 1.31 14.06
 20K read whole contents at once  108.000 70.500 1196.000 1.53 16.96

  ...

400K read one line at a time   71.700  3.690  207.00 19.43 56.10

 ...

400K read whole contents at once  112.000 81.000  841.000 1.38 10.38
400K seek forward 1000 units at a time 87.400 67.300  589.000 1.30  8.75
400K seek forward one unit at a time0.090  0.0710.873 1.28 12.31


Looks like most of these still have substantial falloffs in performance.
Is this part still a work in progress or is this as good as its going to get?



** Text overwrite **
 20K modify one unit at a time  0.296  0.0721.320 4.09 18.26
400K modify 20 units at a time  5.690  1.360   22.500 4.18 16.54
400K modify 4096 units at a time  151.000 88.300  509.000 1.71  5.76


Same question on this batch.


Raymond
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Antoine Pitrou

Hello,

Raymond Hettinger python at rcn.com writes:
 
 MB/S MB/SMB/S
 in C  in py3k  in 2.7 C/3k 2.7/3k
  ** Text append **
   10M write 1e6 units at a time261.00 218.000 1540.000 1.20  7.06
   20K write one unit at a time   0.983  0.0811.33 12.08 16.34
  400K write 20 units at a time  16.000  1.510   22.90 10.60 15.17
  400K write 4096 units at a time   236.00 118.000 1244.000 2.00 10.54
 
 Do you know why the text-appends fell off so much in the 1st and last cases?

When writing large chunks of text (4096, 1e6), bookkeeping costs become
marginal and encoding costs dominate. 2.x has no encoding costs, which
explains why it's so much faster.

A quick test tells me utf-8 encoding runs at 280 MB/s. on this dataset (the
400KB text file). You see that there is not much left to optimize on large
writes.


  ** Text input **
   10M read whole contents at once   89.700 68.700  966.000 1.31 14.06
   20K read whole contents at once  108.000 70.500 1196.000 1.53 16.96
...
  400K read one line at a time   71.700  3.690  207.00 19.43 56.10
   ...
  400K read whole contents at once  112.000 81.000  841.000 1.38 10.38
  400K seek forward 1000 units at a time 87.400 67.300  589.000 1.30  8.75
  400K seek forward one unit at a time0.090  0.0710.873 1.28 12.31
 
 Looks like most of these still have substantial falloffs in performance.
 Is this part still a work in progress or is this as good as its going to get?

There is nothing left obvious to optimize in the read() department.
Decoding and newline translation costs dominate. Decoding has already been 
optimized for the most popular encodings in py3k:
http://mail.python.org/pipermail/python-checkins/2009-January/077024.html

Newline translation follows a fast path depending on various heuristics.

I also took particular care of the read one line at a time scenario because
it's the most likely idiom when reading a text file. I think there is hardly
anything left to optimize on this one. Your eyes are welcome, though.

Note that the benchmark is run with the following default settings for text
I/O: utf-8 encoding, universal newlines enabled, text containing only \n 
newlines.
You can play with settings here:
http://svn.python.org/view/sandbox/trunk/iobench/


Text seek() and tell(), on the other hand, is known to be slow, and it could
perhaps be improved. It is assumed, however, that they won't be used a lot
for text files.


  ** Text overwrite **
   20K modify one unit at a time  0.296  0.0721.320 4.09 18.26
  400K modify 20 units at a time  5.690  1.360   22.500 4.18 16.54
  400K modify 4096 units at a time  151.000 88.300  509.000 1.71  5.76
 
 Same question on this batch.

There seems to be some additional overhead in this case. Perhaps it could be
improved, I'll have to take a look... But I doubt overwriting chunks of text
is a common scenario.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Paul Moore
2009/1/28 Antoine Pitrou solip...@pitrou.net:
 When writing large chunks of text (4096, 1e6), bookkeeping costs become
 marginal and encoding costs dominate. 2.x has no encoding costs, which
 explains why it's so much faster.

Interesting. However, it's still slower in terms of perception. In
2.x, I regularly do the equivalent of

f = open(filename, r)
... read strings from f ...

Yes, I know this is byte I/O in reality, but for everything I do
(Latin-1 on input and output, and for most practical purposes
ASCII-only) it simply isn't relevant to me.

If Python 3.x makes this substantially slower (working in a naive mode
where I ignore encoding issues), claiming it's encoding costs
doesn't make any difference - in a practical sense, I don't get any
benefits and yet I pay the cost. (You can say my approach is wrong,
but so what? I'll just say that 2.x is faster for me, and not migrate.
Ultimately, this is about marketing 3.x...)

It would be helpful to limit this cost as much as possible - maybe
that's simply ensuring that the default encoding for open is (in the
majority of cases) a highly-optimised one whose costs *don't* dominate
in the way you describe (although if you're using UTF-8, I'd guess
that would be the usual default on Linux, so it looks like there's
some work needed there). Hmm, I just checked and on Windows, it
appears that sys.getdefaultencoding() is UTF-8. That seems odd - I
would have thought the majority of Windows systems were NOT set to use
UTF-8 by default...

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Victor Stinner
Le Wednesday 28 January 2009 11:55:16 Antoine Pitrou, vous avez écrit :
 2.x has no encoding costs, which explains why it's so much faster.

Why not testing io.open() or codecs.open() which create unicode strings?

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Antoine Pitrou
Paul Moore p.f.moore at gmail.com writes:
 
 It would be helpful to limit this cost as much as possible - maybe
 that's simply ensuring that the default encoding for open is (in the
 majority of cases) a highly-optimised one whose costs *don't* dominate
 in the way you describe

As I pointed out, utf-8, utf-16 and latin1 decoders have already been optimized
in py3k. For *pure ASCII* input, utf-8 decoding is blazingly fast (1GB/s here).
The dataset for iobench isn't pure ASCII though, and that's why it's not as 
fast.

People are invited to test their own workloads with the io-c branch and report
performance figures (and possible bugs). There are so many possibilities that
the benchmark figures given by a generic tool can only be indicative.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Antoine Pitrou
Victor Stinner victor.stinner at haypocalc.com writes:
 
 Le Wednesday 28 January 2009 11:55:16 Antoine Pitrou, vous avez écrit :
  2.x has no encoding costs, which explains why it's so much faster.
 
 Why not testing io.open() or codecs.open() which create unicode strings?

The goal is to test the idiomatic way of opening text files (the one obvious
way to do it, if you want).
There is no doubt that io.open() and codecs.open() in 2.x are much slower than
the io-c branch. However, nobody is expecting very good performance from
io.open() and codecs.open() in 2.x either.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Paul Moore
2009/1/28 Antoine Pitrou solip...@pitrou.net:
 Paul Moore p.f.moore at gmail.com writes:

 It would be helpful to limit this cost as much as possible - maybe
 that's simply ensuring that the default encoding for open is (in the
 majority of cases) a highly-optimised one whose costs *don't* dominate
 in the way you describe

 As I pointed out, utf-8, utf-16 and latin1 decoders have already been 
 optimized
 in py3k. For *pure ASCII* input, utf-8 decoding is blazingly fast (1GB/s 
 here).
 The dataset for iobench isn't pure ASCII though, and that's why it's not as 
 fast.

Ah, thanks. Although you said your data was 95% ASCII, and you're
getting decode speeds of 250MB/s. That's 75% slowdown for 5% of the
data! Surely that's not right???

 People are invited to test their own workloads with the io-c branch and report
 performance figures (and possible bugs). There are so many possibilities that
 the benchmark figures given by a generic tool can only be indicative.

At the moment, I don't have the time to download and build the branch,
and in any case as I only have Visual Studio Express, I don't get the
PGO optimisations, making any tests I do highly suspect.

Paul.

PS Can anyone comment on why Python defaults to utf-8 on Windows? That
seems like a highly suspect default...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Victor Stinner
Le Wednesday 28 January 2009 12:41:07 Antoine Pitrou, vous avez écrit :
  Why not testing io.open() or codecs.open() which create unicode strings?

 There is no doubt that io.open() and codecs.open() in 2.x are much slower
 than the io-c branch. However, nobody is expecting very good performance
 from io.open() and codecs.open() in 2.x either.

I use codecs.open() in my programs and so I'm interested by the benchmark on 
this function ;-)

But if I understand correctly, Python (3.1 ?) will be faster (or much faster) 
to read/write files in unicode, and that's a great news ;-)

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Lawrence Oluyede
On Wed, Jan 28, 2009 at 4:32 AM, Steve Holden st...@holdenweb.com wrote:
 I think that both 3.0 and 2.6 were rushed releases. 2.6 showed it in the
 inclusion (later recognizable as somewhat ill-advised so late in the
 day) of multiprocessing; 3.0 shows it in the very fact that this
 discussion has become necessary.

What about some kine of mechanism to triage 3rd party modules?
Something like:

module gains popularity - the core team decides it's worthy - the
module is included in the library
in some kind of contrib/ext package (like the future mechanism)
and for one major release stays
in that package (so developers don't have to rush fixing _all_ the
bugs they can while making a major
release) - after (at least) one major release the module moves up one
level and it's considered stable and rock solid.

Meanwhile the documentation must say that the 3rd party contributed
module is not considered production
ready, though usable, until the release current + 1

I don't know if it feasible, if it's insane or what, it's just an idea I had.

-- 
Lawrence, http://oluyede.org - http://twitter.com/lawrenceoluyede
It is difficult to get a man to understand
something when his salary depends on not
understanding it - Upton Sinclair
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Antoine Pitrou
Paul Moore p.f.moore at gmail.com writes:
 
  As I pointed out, utf-8, utf-16 and latin1 decoders have already been
optimized
  in py3k. For *pure ASCII* input, utf-8 decoding is blazingly fast (1GB/s
here).
  The dataset for iobench isn't pure ASCII though, and that's why it's not
as fast.
 
 Ah, thanks. Although you said your data was 95% ASCII, and you're
 getting decode speeds of 250MB/s. That's 75% slowdown for 5% of the
 data! Surely that's not right???

If you look at how utf-8 decoding is implemented (in unicodeobject.c), it's
quite obvious why it is so :-) There is a (very) fast path for chunks of pure
ASCII data, and (fast but not blazingly fast) fallback for non ASCII data.

Please don't think of it as a slowdown... It's still much faster than 2.x, which
manages 130MB/s on the same data.

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Paul Moore
2009/1/28 Antoine Pitrou solip...@pitrou.net:
 If you look at how utf-8 decoding is implemented (in unicodeobject.c), it's
 quite obvious why it is so :-) There is a (very) fast path for chunks of pure
 ASCII data, and (fast but not blazingly fast) fallback for non ASCII data.

Thanks for the explanation.

 Please don't think of it as a slowdown... It's still much faster than 2.x, 
 which
 manages 130MB/s on the same data.

Don't get me wrong - I'm hugely grateful for this work. And
personally, I don't expect that I/O speed is ever likely to be a real
bottleneck in the type of program I write. But I'm concerned that
(much as with the whole Python 3.0 is incompatible, and it will be
hard to port to meme) people will pick up on raw benchmark figures -
no matter how much they aren't comparing like with like - and start
making it sound like Python 3.0 I/O is slower than 2.x - which is a
great disservice to the good work that's been done.

I do think it's worth taking care over the default encoding, though.
Quite apart from performance, getting correct behaviour is
important. I can't speak for Unix, but on Windows, the following
behaviour feels like a bug to me:

echo a£b a1

python
Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type help, copyright, credits or license for more information.
 print open(a1).read()
a£b

 ^Z


\Apps\Python30\python.exe
Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win32
Type help, copyright, credits or license for more information.
 print(open(a1).read())
Traceback (most recent call last):
  File stdin, line 1, in module
  File D:\Apps\Python30\lib\io.py, line 1491, in write
b = encoder.encode(s)
  File D:\Apps\Python30\lib\encodings\cp850.py, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0153' in
position 1: character maps to undefined
 ^Z

chcp
Active code page: 850

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Antoine Pitrou
Le mercredi 28 janvier 2009 à 16:54 +, Paul Moore a écrit :
 I do think it's worth taking care over the default encoding, though.
 Quite apart from performance, getting correct behaviour is
 important. I can't speak for Unix, but on Windows, the following
 behaviour feels like a bug to me:
[...]

Please open a bug :)

cheers

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread M.-A. Lemburg
On 2009-01-27 22:19, Raymond Hettinger wrote:
 From: Martin v. Löwis mar...@v.loewis.de
 Releasing 3.1 6 months after 3.0 sounds reasonable; I don't think
 it should be released earlier (else 3.0 looks fairly ridiculous).
 
 I think it should be released earlier and completely supplant 3.0
 before more third-party developers spend time migrating code.
 We needed 3.0 to get released so we could get the feedback
 necessary to shake it out.  Now, it is time for it to fade into history
 and take advantage of the lessons learned.
 
 The principles for the 2.x series don't really apply here.  In 2.x, there
 was always a useful, stable, clean release already fielded and there
 were tons of third-party apps that needed a slow rate of change.
 
 In contrast, 3.0 has a near zero installed user base (at least in terms
 of being used in production).  It has very few migrated apps.  It is
 not particularly clean and some of the work for it was incomplete
 when it was released.
 
 My preference is to drop 3.0 entirely (no incompatable bugfix release)
 and in early February release 3.1 as the real 3.x that migrators ought
 to aim for and that won't have incompatable bugfix releases.  Then at
 PyCon, we can have a real bug day and fix-up any chips in the paint.
 
 If 3.1 goes out right away, then it doesn't matter if 3.0 looks ridiculous.
 All eyes go to the latest release.  Better to get this done before more
 people download 3.0 to kick the tires.

Why don't we just mark 3.0.x as experimental branch and keep updating/
fixing things that were not sorted out for the 3.0.0 release ?! I think
that's a fair approach, given that the only way to get field testing
for new open-source software is to release early and often.

A 3.1 release should then be the first stable release of the 3.x series
and mark the start of the usual deprecation mechanisms we have
in the 2.x series. Needless to say, that rushing 3.1 out now would
only cause yet another experimental release... major releases do take
time to stabilize.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 28 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Martin v. Löwis
 PS Can anyone comment on why Python defaults to utf-8 on Windows?

Don't panic. It doesn't, and you are misinterpreting what you are
seeing.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Martin v. Löwis
Paul Moore wrote:
 Hmm, I just checked and on Windows, it
 appears that sys.getdefaultencoding() is UTF-8. That seems odd - I
 would have thought the majority of Windows systems were NOT set to use
 UTF-8 by default...

In Python 3, sys.getdefaultencoding() is utf-8 on all platforms, just
as it was ascii in 2.x, on all platforms. The default encoding isn't
used for I/O; check f.encoding to find out what encoding is used to
read the file you are reading.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Martin v. Löwis
 print(open(a1).read())
 Traceback (most recent call last):
   File stdin, line 1, in module
   File D:\Apps\Python30\lib\io.py, line 1491, in write
 b = encoder.encode(s)
   File D:\Apps\Python30\lib\encodings\cp850.py, line 19, in encode
 return codecs.charmap_encode(input,self.errors,encoding_map)[0]
 UnicodeEncodeError: 'charmap' codec can't encode character '\u0153' in
 position 1: character maps to undefined

Looks right to me.

Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Paul Moore
2009/1/28 Martin v. Löwis mar...@v.loewis.de:
 Paul Moore wrote:
 Hmm, I just checked and on Windows, it
 appears that sys.getdefaultencoding() is UTF-8. That seems odd - I
 would have thought the majority of Windows systems were NOT set to use
 UTF-8 by default...

 In Python 3, sys.getdefaultencoding() is utf-8 on all platforms, just
 as it was ascii in 2.x, on all platforms. The default encoding isn't
 used for I/O; check f.encoding to find out what encoding is used to
 read the file you are reading.

Thanks for the explanation. It might be clearer to document this a
little more explicitly in the docs for open() (on the basis that
people using open() are the most likely to be naive about encodings).
I'll see if I can come up with an appropriate doc patch.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Paul Moore
2009/1/28 Martin v. Löwis mar...@v.loewis.de:
 print(open(a1).read())
 Traceback (most recent call last):
   File stdin, line 1, in module
   File D:\Apps\Python30\lib\io.py, line 1491, in write
 b = encoder.encode(s)
   File D:\Apps\Python30\lib\encodings\cp850.py, line 19, in encode
 return codecs.charmap_encode(input,self.errors,encoding_map)[0]
 UnicodeEncodeError: 'charmap' codec can't encode character '\u0153' in
 position 1: character maps to undefined

 Looks right to me.

I don't see why. I wrote the file from the console (cp850), read it in
Python using the default encoding (which I would expect to match the
console encoding), wrote it to sys.stdout (which I would expect to use
the console encoding).

How did the character end up not being encodable, when I've only used
one encoding throughout? (And if my assumptions about the encodings
used are wrong at some point, that's what I'm suggesting is the
error).

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Steven Bethard
On Wed, Jan 28, 2009 at 10:29 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 Notice that the determination of the specific encoding used is fairly
 elaborate:
 - if IO is to a terminal, Python tries to determine the encoding of
  the terminal. This is mostly relevant for Windows (which uses,
  by default, the OEM code page in the terminal).
 - if IO is to a file, Python tries to guess the common encoding
  for the system. On Unix, it queries the locale, and falls back
  to ascii if no locale is set. On Windows, it uses the ANSI
  code page. On OSX, it uses the system encoding.
 - if IO is binary, (clearly) no encoding is used. Network IO is
  always binary.
 - for file names, yet different algorithms apply. On Windows, it
  uses the Unicode API, so no need for an encoding. On Unix, it
  (again) uses the locale encoding. On OSX, it uses UTF-8
  (just to be clear: this applies to the first argument of open(),
   not to the resulting file object)

This a very helpful explanation. Is it in the docs somewhere, or if it
isn't, could it be?

Steve
-- 
I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a
tiny blip on the distant coast of sanity.
--- Bucky Katt, Get Fuzzy
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Martin v. Löwis
Paul Moore wrote:
 2009/1/28 Martin v. Löwis mar...@v.loewis.de:
 print(open(a1).read())
 Traceback (most recent call last):
   File stdin, line 1, in module
   File D:\Apps\Python30\lib\io.py, line 1491, in write
 b = encoder.encode(s)
   File D:\Apps\Python30\lib\encodings\cp850.py, line 19, in encode
 return codecs.charmap_encode(input,self.errors,encoding_map)[0]
 UnicodeEncodeError: 'charmap' codec can't encode character '\u0153' in
 position 1: character maps to undefined
 Looks right to me.
 
 I don't see why. I wrote the file from the console (cp850), read it in
 Python using the default encoding (which I would expect to match the
 console encoding), wrote it to sys.stdout (which I would expect to use
 the console encoding).
 
 How did the character end up not being encodable, when I've only used
 one encoding throughout? (And if my assumptions about the encodings
 used are wrong at some point, that's what I'm suggesting is the
 error).

Well, first try to understand what the error *is*:

py unicodedata.name('\u0153')
'LATIN SMALL LIGATURE OE'
py unicodedata.name('£')
'POUND SIGN'
py ascii('£')
'\\xa3'
py ascii('£'.encode('cp850').decode('cp1252'))
'\\u0153'

So when Python reads the file, it uses cp1252. This is sensible - just
that the console uses cp850 doesn't change the fact that the common
encoding of files on your system is cp1252. It is an unfortunate fact
of Windows that the console window uses a different encoding from the
rest of the system (namely, the console uses the OEM code page, and
everything else uses the ANSI code page).

Furthermore, U+0153 does not exist in cp850 (i.e. the terminal doesn't
support œ), hence the exception.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Martin v. Löwis
 This a very helpful explanation. Is it in the docs somewhere, or if it
 isn't, could it be?

I actually don't know.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Paul Moore
2009/1/28 Martin v. Löwis mar...@v.loewis.de:
 Well, first try to understand what the error *is*:

 py unicodedata.name('\u0153')
 'LATIN SMALL LIGATURE OE'
 py unicodedata.name('£')
 'POUND SIGN'
 py ascii('£')
 '\\xa3'
 py ascii('£'.encode('cp850').decode('cp1252'))
 '\\u0153'

 So when Python reads the file, it uses cp1252. This is sensible - just
 that the console uses cp850 doesn't change the fact that the common
 encoding of files on your system is cp1252. It is an unfortunate fact
 of Windows that the console window uses a different encoding from the
 rest of the system (namely, the console uses the OEM code page, and
 everything else uses the ANSI code page).

Ah, I see. That is entirely obvious. The key bit of information is
that the default io encoding is cp1252, not cp850. I know that in
theory, I see the consequences often enough (:-)), but it isn't
instinctive for me. And the simple default encoding is system
dependent comment is not very helpful in terms of warning me that
there could be an issue.

I do think that more wording around encoding defaults would be useful
- as I said, I'll think about how best it could be made into a doc
patch. I suspect the best approach would be to have a section (maybe
in the docs for the codecs module) explaining all the details, and
then a cross-reference to that from the various places (open, io)
where default encodings are mentioned.

Paul.


 Furthermore, U+0153 does not exist in cp850 (i.e. the terminal doesn't
 support œ), hence the exception.

 Regards,
 Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Terry Reedy

Steven Bethard wrote:

On Wed, Jan 28, 2009 at 10:29 AM, Martin v. Löwis mar...@v.loewis.de wrote:

Notice that the determination of the specific encoding used is fairly
elaborate:
- if IO is to a terminal, Python tries to determine the encoding of
 the terminal. This is mostly relevant for Windows (which uses,
 by default, the OEM code page in the terminal).
- if IO is to a file, Python tries to guess the common encoding
 for the system. On Unix, it queries the locale, and falls back
 to ascii if no locale is set. On Windows, it uses the ANSI
 code page. On OSX, it uses the system encoding.
- if IO is binary, (clearly) no encoding is used. Network IO is
 always binary.
- for file names, yet different algorithms apply. On Windows, it
 uses the Unicode API, so no need for an encoding. On Unix, it
 (again) uses the locale encoding. On OSX, it uses UTF-8
 (just to be clear: this applies to the first argument of open(),
  not to the resulting file object)


This a very helpful explanation. Is it in the docs somewhere, or if it
isn't, could it be?


Here is the  current entry on encodings in the Lib ref, built-in types, 
file objects.


file.encoding
The encoding that this file uses. When strings are written to a file, 
they will be converted to byte strings using this encoding. In addition, 
when the file is connected to a terminal, the attribute gives the 
encoding that the terminal is likely to use (that information might be 
incorrect if the user has misconfigured the terminal). The attribute is 
read-only and may not be present on all file-like objects. It may also 
be None, in which case the file uses the system default encoding for 
converting strings.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Jean-Paul Calderone

On Wed, 28 Jan 2009 18:52:41 +, Paul Moore p.f.mo...@gmail.com wrote:

2009/1/28 Martin v. Löwis mar...@v.loewis.de:

Well, first try to understand what the error *is*:

py unicodedata.name('\u0153')
'LATIN SMALL LIGATURE OE'
py unicodedata.name('£')
'POUND SIGN'
py ascii('£')
'\\xa3'
py ascii('£'.encode('cp850').decode('cp1252'))
'\\u0153'

So when Python reads the file, it uses cp1252. This is sensible - just
that the console uses cp850 doesn't change the fact that the common
encoding of files on your system is cp1252. It is an unfortunate fact
of Windows that the console window uses a different encoding from the
rest of the system (namely, the console uses the OEM code page, and
everything else uses the ANSI code page).


Ah, I see. That is entirely obvious. The key bit of information is
that the default io encoding is cp1252, not cp850. I know that in
theory, I see the consequences often enough (:-)), but it isn't
instinctive for me. And the simple default encoding is system
dependent comment is not very helpful in terms of warning me that
there could be an issue.


It probably didn't help that the exception raised told you that the
error was in the charmap codec.  This should have said cp850
instead.  The fact that cp850 is implemented in terms of charmap
isn't very interesting.  The fact that while encoding some text
using cp850 is.

Jean-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Terry Reedy

Michael Foord wrote:

M.-A. Lemburg wrote:



Why don't we just mark 3.0.x as experimental branch and keep updating/
fixing things that were not sorted out for the 3.0.0 release ?! I think
that's a fair approach, given that the only way to get field testing
for new open-source software is to release early and often.

A 3.1 release should then be the first stable release of the 3.x series
and mark the start of the usual deprecation mechanisms we have
in the 2.x series. Needless to say, that rushing 3.1 out now would
only cause yet another experimental release... major releases do take
time to stabilize.

  

+1

I don't think we do users any favours by being cautious in removing / 
fixing things in the 3.0 releases.


I have two main reactions to 3.0.

1. It is great for my purpose -- coding algorithms.
  The cleaner object and text models are a mental relief for me.
  So it was a service to me to release it.
  I look forward to it becoming standard Python and have made my small 
contribution by helping clean up the 3.0 version of the docs.


2. It is something of a trial run that it should be fixed as soon as 
possible.  I seem to remember sometning from Shakespear(?) If it twer 
done, tis best it twer done quickly.


Guido said something over a year ago to the effect that he did not 
expect 3.0 to be used as a production release, so I do not think it 
should to treated as one.  Label it developmental and people will not 
try to keep in use for years and years in the way that, say, 2.4 still is.


tjr

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Adam Olsen
On Wed, Jan 28, 2009 at 11:52 AM, Paul Moore p.f.mo...@gmail.com wrote:
 Ah, I see. That is entirely obvious. The key bit of information is
 that the default io encoding is cp1252, not cp850. I know that in
 theory, I see the consequences often enough (:-)), but it isn't
 instinctive for me. And the simple default encoding is system
 dependent comment is not very helpful in terms of warning me that
 there could be an issue.

 I do think that more wording around encoding defaults would be useful
 - as I said, I'll think about how best it could be made into a doc
 patch. I suspect the best approach would be to have a section (maybe
 in the docs for the codecs module) explaining all the details, and
 then a cross-reference to that from the various places (open, io)
 where default encodings are mentioned.

It'd also help if the file repr gave the encoding:

 f = open('/dev/null')
 f
io.TextIOWrapper object at 0x7ff4468d8a90
 import sys
 sys.stdout
io.TextIOWrapper object at 0x7ff4476126d0

Of course I can check .encoding manually, but it needs to be more intuitive.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Daniel Stutzbach
On Wed, Jan 28, 2009 at 1:42 PM, Adam Olsen rha...@gmail.com wrote:

 It'd also help if the file repr gave the encoding:


+1

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC http://stutzbachenterprises.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Raymond Hettinger

[Adam Olsen]

It'd also help if the file repr gave the encoding:


+1 from me too.  That will be a big help.


Raymond
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Steve Holden
Terry Reedy wrote:
 Michael Foord wrote:
 M.-A. Lemburg wrote:
 
 Why don't we just mark 3.0.x as experimental branch and keep updating/
 fixing things that were not sorted out for the 3.0.0 release ?! I think
 that's a fair approach, given that the only way to get field testing
 for new open-source software is to release early and often.

 A 3.1 release should then be the first stable release of the 3.x series
 and mark the start of the usual deprecation mechanisms we have
 in the 2.x series. Needless to say, that rushing 3.1 out now would
 only cause yet another experimental release... major releases do take
 time to stabilize.

   
 +1

 I don't think we do users any favours by being cautious in removing /
 fixing things in the 3.0 releases.
 
 I have two main reactions to 3.0.
 
 1. It is great for my purpose -- coding algorithms.
   The cleaner object and text models are a mental relief for me.
   So it was a service to me to release it.
   I look forward to it becoming standard Python and have made my small
 contribution by helping clean up the 3.0 version of the docs.
 
 2. It is something of a trial run that it should be fixed as soon as
 possible.  I seem to remember sometning from Shakespear(?) If it twer
 done, tis best it twer done quickly.
 
 Guido said something over a year ago to the effect that he did not
 expect 3.0 to be used as a production release, so I do not think it
 should to treated as one.  Label it developmental and people will not
 try to keep in use for years and years in the way that, say, 2.4 still is.
 
It might also be a good idea to take the download link off the front
page of python.org: until that happens newbies are going to keep coming
along and downloading it because it's the newest.

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Nick Coghlan
Steve Holden wrote:
 2.6 showed it in the
 inclusion (later recognizable as somewhat ill-advised so late in the
 day) of multiprocessing;

Given the longstanding fork() bugs that were fixed as a result of that
inclusion, I think that ill-advised is too strong... could it have done
with a little more time to bed down multiprocessing in particular?
Possibly. Was it worth holding up the whole release just for that? I
don't think so - we'd already fixed up the problems that the test suite
and python-dev were likely to find, so the cost/benefit ratio on a delay
would have been pretty poor.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Martin v. Löwis
 It might also be a good idea to take the download link off the front
 page of python.org: until that happens newbies are going to keep coming
 along and downloading it because it's the newest.

It was (and probably still is) Guido's position that 3.0 *is* the
version that newbies should be using.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1 (io-in-c)

2009-01-28 Thread Paul Moore
2009/1/28 Raymond Hettinger pyt...@rcn.com:
 [Adam Olsen]

 It'd also help if the file repr gave the encoding:

 +1 from me too.  That will be a big help.

Definitely. People *are* going to get confused by encoding errors -
let's give them all the help we can.
Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Stephen J. Turnbull
Martin v. Löwis writes:
   It might also be a good idea to take the download link off the front
   page of python.org: until that happens newbies are going to keep coming
   along and downloading it because it's the newest.
  
  It was (and probably still is) Guido's position that 3.0 *is* the
  version that newbies should be using.

Indeed.  See Terry Reedy's post.

Somebody who is looking for a platform for a production application is
not going to download something because it's the newest.  Sure,
those advocating other platforms will carp about Python 3.0, but hey,
where is Perl 6?  The amazing thing about a dancing bear is *not* how
well it dances.  Let's not get too worried about the PR aspects; just
fixing the bugs as we go along will fix that to the extent that people
are not totally prejudiced anyway.

I think there is definitely something to the notion that the 3.x
vs. 3.0.y distinction should signal something, and I personally like
MAL's suggestion that 3.0.x should be marked some sort of beta in
perpetuity, or at least until 3.1 is ready to ship as stable and
production-ready.  (That's AIUI, MAL's intent may be somewhat
different.)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Terry Reedy

Stephen J. Turnbull wrote:

Martin v. Löwis writes:
   It might also be a good idea to take the download link off the front
   page of python.org: until that happens newbies are going to keep coming
   along and downloading it because it's the newest.


By that logic, I would suggest removing 2.6 ;-)
See below.

  
  It was (and probably still is) Guido's position that 3.0 *is* the

  version that newbies should be using.

Indeed.  See Terry Reedy's post.


When people ask on c.l.p, I recommend either 3.0 for the relative 
cleanliness or 2.5 (until now, at least) for the 3rd-party add-on 
availability (that will gradually improve for both 2.6 and more slowly, 
for 3.x).  I expect that some newbies would find 2.6 a somewhat 
confusing mix of old and new.


tjr


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.0.1

2009-01-28 Thread Steve Holden
Terry Reedy wrote:
 Stephen J. Turnbull wrote:
 Martin v. Löwis writes:
It might also be a good idea to take the download link off the front
page of python.org: until that happens newbies are going to keep
 coming
along and downloading it because it's the newest.
 
 By that logic, I would suggest removing 2.6 ;-)
 See below.
 
 It was (and probably still is) Guido's position that 3.0 *is* the
   version that newbies should be using.

 Indeed.  See Terry Reedy's post.
 
 When people ask on c.l.p, I recommend either 3.0 for the relative
 cleanliness or 2.5 (until now, at least) for the 3rd-party add-on
 availability (that will gradually improve for both 2.6 and more slowly,
 for 3.x).  I expect that some newbies would find 2.6 a somewhat
 confusing mix of old and new.
 
Fair point. At least we both agree that the current site doesn't best
serve the punters.

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com