Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>> description of the issues:
>> .  Basically, some
>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>> being UTF-8, so it's become a convention to do that.  That's not good
>> enough, so you need to guess the encoding as well to make sure, but if there
>> is a BOM and you can otherwise verify that the file is probably UTF-8
>> encoded, you should discard it.
> 
> That doesn't make sense. If the file isn't UTF-8 you can't see the
> BOM, because the BOM itself is UTF-8-encoded.

I think what Glyph meant is this: if a file starts with the UTF-8
signature, assume it's UTF-8. Then validate the assumption against the
rest of the file also, and then process it as UTF-8. If the rest clearly
is not UTF-8, assume that the UTF-8 signature is bogus.

I understood this proposal as a general processing guideline, not
something the io library should do (but, say, a text editor).

FWIW, I'm personally in favor of using the UTF-8 signature. If people
consider them crazy talk, that may be because UTF-8 can't possibly have
a byte order - hence I call it a signature, not the BOM. As a signature,
I don't consider it crazy at all. There is a long tradition of having
magic bytes in files (executable files, Postscript, PDF, ... - see
/etc/magic). Having a magic byte sequence for plain text to denote the
encoding is useful and helps reducing moji-bake. This is the reason it's
used on Windows: notepad would normally assume that text is in the ANSI
code page, and for compatibility, it can't stop doing that. So the UTF-8
signature gives them an exit strategy.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
> But it should do something sane when reading such files.  I can't
> really see any harm in throwing it away, especially since use of
> ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated
> IIRC.

And indeed it does, when you open the file in the utf-8-sig encoding.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 03:23:08, MRAB a écrit :
> Guido van Rossum wrote:
> > I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
> > talk. And for the other two, perhaps it would make more sense to have
> > a separate encoding-guessing function that takes a binary stream and
> > returns a text stream wrapping it with the proper encoding?
> 
> Alternatively, have a universal UTF-8/16/32 encoding, ie one that
> expects UTF-8,
> with or without BOM, or UTF-16/32 with BOM.

Do you mean open(filename, encoding="BOM")? I suppose that "BOM" would be a 
magical value specific to read a text file (open(filename, "r")), not a real 
codec?

Otherwise which encoding should be used for open(filename, "w", 
encoding="BOM")?

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
> Builtin open() function is unable to open an UTF-16/32 file starting with a 
> BOM if the encoding is not specified (raise an unicode error). For an UTF-8 
> file starting with a BOM, read()/readline() returns also the BOM whereas the 
> BOM should be "ignored".

It depends. If you use the utf-8-sig encoding, it *will* ignore the
UTF-8 signature.

> Since my proposition changes the result TextIOWrapper.read()/readline() for 
> files starting with a BOM, we might introduce an option to open() to enable 
> the new behaviour. But is it really needed to keep the backward compatibility?

Absolutely. And there is no need to produce a new option, but instead
use the existing options: define an encoding that auto-detects the
encoding from the family of BOMs. Maybe you call it encoding="sniff".

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] --enabled-shared broken on freebsd5?

2010-01-08 Thread Martin v. Löwis
Nicholas Bastin wrote:
> I think this problem probably needs to move over to distutils-sig, as
> it doesn't seem to be specific to the way that Python itself uses
> distutils.

I'm fairly skeptical that anybody on distutils SIG is interested in
details of the Python build process...

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
(...)
> (And yes, I know this happens. Doesn't mean we need to auto-guess by
> default; there are lots of issues e.g. what should happen after
> seeking to offset 0?)

I wrote a new version of my patch (version 3):

 * don't change the default behaviour: use open(filename, encoding="BOM") to 
check the BOM is there is any
 * fix for seek(0): always ignore the BOM
 * add an unit test: check that the right encoding is detect, but also the the 
BOM is ignored (especially after a seek(0))

BOM encoding doesn't work for writing into a file, so open(filename, "w", 
encoding="BOM") raises a ValueError.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 01:52:20, Guido van Rossum a écrit :
> And for the other two, perhaps it would make more sense to have
> a separate encoding-guessing function that takes a binary stream and
> returns a text stream wrapping it with the proper encoding?

I choosed to modify open()+TextIOWrapper instead of writing a new function 
because I would like to avoid an extra read operation (syscall) on the file. 
With my implementation, no specific read operation is needed to detect the 
BOM. The BOM is simply checked in the first _read_chunk().

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :
> > Builtin open() function is unable to open an UTF-16/32 file starting with
> > a BOM if the encoding is not specified (raise an unicode error). For an
> > UTF-8 file starting with a BOM, read()/readline() returns also the BOM
> > whereas the BOM should be "ignored".
> 
> It depends. If you use the utf-8-sig encoding, it *will* ignore the
> UTF-8 signature.

Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and 
UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to 
remove the BOM after the first read (much harder if you use a module like 
ConfigParser or csv).

> > Since my proposition changes the result TextIOWrapper.read()/readline()
> > for files starting with a BOM, we might introduce an option to open() to
> > enable the new behaviour. But is it really needed to keep the backward
> > compatibility?
> 
> Absolutely. And there is no need to produce a new option, but instead
> use the existing options: define an encoding that auto-detects the
> encoding from the family of BOMs. Maybe you call it encoding="sniff".

Good idea, I choosed open(filename, encoding="BOM").

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-08 Thread Antoine Pitrou
Le Thu, 07 Jan 2010 22:11:36 +0100, Martin v. Löwis a écrit :
> 
> Even if we do use the new API, and correctly, it still might be
> confusing if the contents of the buffer changes underneath.

Well, no more confusing than when you compute a SHA1 hash or zlib-
compress the buffer, is it?

Regards

Antoine


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Victor Stinner  haypocalc.com> writes:
> 
> I wrote a new version of my patch (version 3):
> 
>  * don't change the default behaviour: use open(filename, encoding="BOM") to 
> check the BOM is there is any

Well, I think if we implement this the default behaviour *should* be changed.
It looks a bit senseless to have two different "auto-choose" options, one with
encoding=None and one with encoding="BOM".

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-08 Thread Guido van Rossum
On Fri, Jan 8, 2010 at 6:27 AM, Antoine Pitrou  wrote:
> Le Thu, 07 Jan 2010 22:11:36 +0100, Martin v. Löwis a écrit :
>>
>> Even if we do use the new API, and correctly, it still might be
>> confusing if the contents of the buffer changes underneath.
>
> Well, no more confusing than when you compute a SHA1 hash or zlib-
> compress the buffer, is it?

That depends. Algorithms that make exactly one pass over the buffer
will run fine (maybe producing a meaningless result). But the regex
matcher may scan the buffer repeatedly (for backtracking purposes) and
it would take a considerable analysis to prove that cannot mess up its
internal data structures if the data underneath changes. (I give it a
decent chance that it's fine, but since it was written without ever
considering this possibility I'm not 100% sure.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 11:55 PM, Glyph Lefkowitz
 wrote:
> I'm saying that the BOM itself isn't enough to detect that the file is 
> actually UTF-8.

And I'm saying that it is, with as much certainty as we can ever guess
the encoding of a file.

> If (for whatever reason: explicitly specified, guessed in some other way) the 
> file's encoding is determined to be something else, the bytes comprising the 
> BOM should be decoded as normal.  It's just that the UTF-8 decoding of the 
> BOM at the start of a file should be "".

Sure, a Latin-1-encoded file could start with the same pattern that is
a UTF-8-encoded BOM. But at that point, a UTF-16-encoded file is also
valid Latin-1.

The question was in the context of encoding-guessing; if we're
guessing, a UTF-8-encoded BOM cannot signify anything else but UTF-8.
(Ditto for UTF-16 and UTF-32 BOMs.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou  wrote:
> Victor Stinner  haypocalc.com> writes:
>>
>> I wrote a new version of my patch (version 3):
>>
>>  * don't change the default behaviour: use open(filename, encoding="BOM") to
>> check the BOM is there is any
>
> Well, I think if we implement this the default behaviour *should* be changed.
> It looks a bit senseless to have two different "auto-choose" options, one with
> encoding=None and one with encoding="BOM".

Well there *are* two different auto options: use the environment
variables (LANG etc.) or inspect the contents of the file. I think it
would be useful to have ways to specify both.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Fri, Jan 8, 2010 at 1:05 AM, "Martin v. Löwis"  wrote:
>>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>>> description of the issues:
>>> .  Basically, some
>>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>>> being UTF-8, so it's become a convention to do that.  That's not good
>>> enough, so you need to guess the encoding as well to make sure, but if there
>>> is a BOM and you can otherwise verify that the file is probably UTF-8
>>> encoded, you should discard it.
>>
>> That doesn't make sense. If the file isn't UTF-8 you can't see the
>> BOM, because the BOM itself is UTF-8-encoded.
>
> I think what Glyph meant is this: if a file starts with the UTF-8
> signature, assume it's UTF-8. Then validate the assumption against the
> rest of the file also, and then process it as UTF-8. If the rest clearly
> is not UTF-8, assume that the UTF-8 signature is bogus.
>
> I understood this proposal as a general processing guideline, not
> something the io library should do (but, say, a text editor).
>
> FWIW, I'm personally in favor of using the UTF-8 signature. If people
> consider them crazy talk, that may be because UTF-8 can't possibly have
> a byte order - hence I call it a signature, not the BOM. As a signature,
> I don't consider it crazy at all. There is a long tradition of having
> magic bytes in files (executable files, Postscript, PDF, ... - see
> /etc/magic). Having a magic byte sequence for plain text to denote the
> encoding is useful and helps reducing moji-bake. This is the reason it's
> used on Windows: notepad would normally assume that text is in the ANSI
> code page, and for compatibility, it can't stop doing that. So the UTF-8
> signature gives them an exit strategy.

Sure. I said "crazy talk" only to stir up discussion. Which worked. :-)

Also, I don't want Python's default behavior to change -- sniffing the
encoding should be a separate option.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver  wrote:
> The BOM should not be seekeable if the file is opened with the proposed
> "guess encoding from BOM" mode:  it isn't properly part of the stream at
> all in that case.

This feels about right to me. There are still questions though:
immediately after opening a file with a BOM, what should .tell()
return? And regardless of that, .seek(0) should put the file in that
same initial state.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Guido van Rossum  python.org> writes:
> 
> > Well, I think if we implement this the default behaviour *should* be 
> > changed.
> > It looks a bit senseless to have two different "auto-choose" options, one
with
> > encoding=None and one with encoding="BOM".
> 
> Well there *are* two different auto options: use the environment
> variables (LANG etc.) or inspect the contents of the file. I think it
> would be useful to have ways to specify both.

Yes, perhaps. In the context of open() however I think it would be helpful to
change the default.
Moreover, reading the BOM is certainly much more reliable than our current
guessing based on the locale or the "device encoding".

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread M.-A. Lemburg
Guido van Rossum wrote:
> On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou  wrote:
>> Victor Stinner  haypocalc.com> writes:
>>>
>>> I wrote a new version of my patch (version 3):
>>>
>>>  * don't change the default behaviour: use open(filename, encoding="BOM") to
>>> check the BOM is there is any
>>
>> Well, I think if we implement this the default behaviour *should* be changed.
>> It looks a bit senseless to have two different "auto-choose" options, one 
>> with
>> encoding=None and one with encoding="BOM".
> 
> Well there *are* two different auto options: use the environment
> variables (LANG etc.) or inspect the contents of the file. I think it
> would be useful to have ways to specify both.

Shouldn't this encoding guessing be a separate function that you call
on either a file or a seekable stream ?

After all, detecting encodings is just as useful to have for non-file
streams. You'd then avoid having to stuff everything into
a single function call and also open up the door for more complex
application specific guess work or defaults.

The whole process would then have two steps:

 1. guess encoding

  import codecs
  encoding = codecs.guess_file_encoding(filename)

 2. open the file with the found encoding

  f = open(filename, encoding=encoding)

For seekable streams f, you'd have:

 1. guess encoding

  import codecs
  encoding = codecs.guess_stream_encoding(f)

 2. wrap the stream with a reader for the found encoding

  reader_class = codecs.getreader(encoding)
  g = reader_class(f)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 08 2010)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Guido van Rossum  python.org> writes:
> 
> On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver  palladion.com>
wrote:
> > The BOM should not be seekeable if the file is opened with the proposed
> > "guess encoding from BOM" mode:  it isn't properly part of the stream at
> > all in that case.
> 
> This feels about right to me. There are still questions though:
> immediately after opening a file with a BOM, what should .tell()
> return?

tell() in the context of text I/O is specified to return an "opaque cookie". So
whatever value it returns would probably be fine, as long as seeking to that
value leaves the file in an acceptable state.

Rewinding (seeking to 0) in the presence of a BOM is already reasonably
supported by the TextIOWrapper object:

>>> dec = codecs.getincrementaldecoder('utf-16')()
>>> dec.decode(b'\xff\xfea\x00b\x00')
'ab'
>>> dec.decode(b'\xff\xfea\x00b\x00')
'\ufeffab'
>>> 
>>> bio = io.BytesIO(b'\xff\xfea\x00b\x00')
>>> f = io.TextIOWrapper(bio, encoding='utf-16')
>>> f.read()
'ab'
>>> f.seek(0)
0
>>> f.read()
'ab'

There are tests for this in test_io.py (test_encoded_writes, line 1929, and
test_append_bom and test_seek_bom, line 2045).

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread MRAB

Victor Stinner wrote:

Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
(...)

(And yes, I know this happens. Doesn't mean we need to auto-guess by
default; there are lots of issues e.g. what should happen after
seeking to offset 0?)


I wrote a new version of my patch (version 3):

 * don't change the default behaviour: use open(filename, encoding="BOM") to 
check the BOM is there is any

 * fix for seek(0): always ignore the BOM
 * add an unit test: check that the right encoding is detect, but also the the 
BOM is ignored (especially after a seek(0))


BOM encoding doesn't work for writing into a file, so open(filename, "w", 
encoding="BOM") raises a ValueError.



I think it's similar to universal newline mode. You can tell it that
you're reading UTF-something-encoded text (common forms only).

The preference is UTF-8, but it could be UTF-8-sig (from Windows), or
possibly UTF-16/32, which really need a BOM because there are multiple
bytes per codepoint, so the bytes could be big-endian or little-endian.

The BOM (or signature) tells you what the encoding is, defaulting to
UTF-8 if there's none. If it subsequently raises a DecodeError, then
so be it!

Maybe there should also be a way of determining what encoding it decided
it was, so that you can then write a new file in that same encoding.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Summary of Python tracker Issues

2010-01-08 Thread Python tracker

ACTIVITY SUMMARY (01/01/10 - 01/08/10)
Python tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue 
number.  Do NOT respond to this message.


 2544 open (+27) / 16937 closed (+15) / 19481 total (+42)

Open issues with patches:  1017

Average duration of open issues: 708 days.
Median duration of open issues: 464 days.

Open Issues Breakdown
   open  2509 (+27)
pending34 ( +0)

Issues Created Or Reopened (43)
___

Extended slicing with classic class behaves strangely01/07/10
   http://bugs.python.org/issue7532reopened mark.dickinson  
  
   patch   

optparse library documentation has an insignificant formatting i 01/01/10
CLOSED http://bugs.python.org/issue7618created  vazovsky
  
   patch   

imaplib shouldn't use cause DeprecationWarnings in 2.6   01/01/10
CLOSED http://bugs.python.org/issue7619created  djc 
  
   

Vim syntax highlight 01/02/10
   http://bugs.python.org/issue7620created  july
  
   patch   

Test issue   01/02/10
CLOSED http://bugs.python.org/issue7621created  georg.brandl
  
   

[patch] improve unicode methods: split() rsplit() and replace()  01/03/10
   http://bugs.python.org/issue7622created  flox
  
   patch   

PropertyType missing in Lib/types.py 01/03/10
CLOSED http://bugs.python.org/issue7623created  wplappert   
  
   

isinstance(... ,collections.Callable) fails with oldstyle class  01/03/10
   http://bugs.python.org/issue7624created  rgammans
  
   

bytearray needs more tests for "b.some_method()[0] is not b" 01/03/10
   http://bugs.python.org/issue7625created  flox
  
   patch   

Entity references without semicolon in HTMLParser01/03/10
CLOSED http://bugs.python.org/issue7626created  stefan.schweizer
  
   

mailbox.MH.remove() lock handling is broken  01/04/10
   http://bugs.python.org/issue7627created  sraustein   
  
   

round() doesn't work correctly in 3.1.1  01/04/10
CLOSED http://bugs.python.org/issue7628created  bkovt   
  
   

Compiling with mingw32 gcc, content of variable "extra_postargs" 01/04/10
CLOSED http://bugs.python.org/issue7629created  popelkopp   
  
   

Strange behaviour of decimal.Decimal 01/04/10
CLOSED http://bugs.python.org/issue7630created  parmax  
  
   

undefined label: bltin-file-objects  01/04/10
CLOSED http://bugs.python.org/issue7631created  ezio.melotti
  
   

dtoa.c: oversize b in quorem 01/04/10
   http://bugs.python.org/issue7632created  skrah   
  
   

decimal.py: type conversion in context methods   01/04/10
   http://bugs.python.org/issue7633created  skrah   
  
   patch, easy 

next/previous links in documentation skip some sections  01/05/10
CLOSED http://bugs.python.org/issue7634created  gagenellina 
  
   

19.6 xml.dom.pulldom doc: stub?  01/05/10
   http://bugs.python.org/issue7635created  tjreedy 
  
 

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Guido van Rossum wrote:
> On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver  wrote:
>> The BOM should not be seekeable if the file is opened with the proposed
>> "guess encoding from BOM" mode:  it isn't properly part of the stream at
>> all in that case.
> 
> This feels about right to me. There are still questions though:
> immediately after opening a file with a BOM, what should .tell()
> return? And regardless of that, .seek(0) should put the file in that
> same initial state.

I think the behavior should be something like:

 >>> f = open('/path/to/maybe-BOM-encoded-file', 'r', encoding='BOM')
 >>> f.tell()
 0L
 >>> f.seek(-1)
 >>> f.tell() # count of unicode chars in decoded stream
 45L
 >>> f.seek(0)
 >>> f.read(1) # read first unicode char decoded from stream.
 'A'

In other words, the BOM is not readable / seekable at all:  it is
invisible to the consumer of the decoded stream.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [email protected]
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHnyIACgkQ+gerLs4ltQ6s3QCgznD+7FbUzfCbe5TS6OcoXjMg
rdgAoJAMEXe2xwLCIwJaZ6XA6rVyTIAi
=oXb3
-END PGP SIGNATURE-

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

M.-A. Lemburg wrote:

> Shouldn't this encoding guessing be a separate function that you call
> on either a file or a seekable stream ?
> 
> After all, detecting encodings is just as useful to have for non-file
> streams.

Other stream sources typically have out-of-band ways to signal the
encoding:  only when reading from the filesystem do we pretty much
*have* to guess, and in that case the BOM / signature is the best
heuristic we have.  Also, some non-file streams are not seekable, and so
can't be guessed via a pre-pass.

> You'd then avoid having to stuff everything into
> a single function call and also open up the door for more complex
> application specific guess work or defaults.
> 
> The whole process would then have two steps:
> 
>  1. guess encoding
> 
>   import codecs
>   encoding = codecs.guess_file_encoding(filename)

Filename is not enough information:  or do you mean that API to actually
open the stream?

>  2. open the file with the found encoding
> 
>   f = open(filename, encoding=encoding)
> 
> For seekable streams f, you'd have:
> 
>  1. guess encoding
> 
>   import codecs
>   encoding = codecs.guess_stream_encoding(f)
> 
>  2. wrap the stream with a reader for the found encoding
> 
>   reader_class = codecs.getreader(encoding)
>   g = reader_class(f)
> 


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [email protected]
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHoU4ACgkQ+gerLs4ltQ5o3QCeLOJ7J91E+5f66vhgu1BUhYh4
9UgAnR2IeCd0BCsPez8ZilGNHJfhRn3Y
=SoPb
-END PGP SIGNATURE-

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martin v. Löwis wrote:

>>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>>> description of the issues:
>>> .  Basically, some
>>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>>> being UTF-8, so it's become a convention to do that.  That's not good
>>> enough, so you need to guess the encoding as well to make sure, but if there
>>> is a BOM and you can otherwise verify that the file is probably UTF-8
>>> encoded, you should discard it.
>> That doesn't make sense. If the file isn't UTF-8 you can't see the
>> BOM, because the BOM itself is UTF-8-encoded.
> 
> I think what Glyph meant is this: if a file starts with the UTF-8
> signature, assume it's UTF-8. Then validate the assumption against the
> rest of the file also, and then process it as UTF-8. If the rest clearly
> is not UTF-8, assume that the UTF-8 signature is bogus.

If the programmer opens the file using a "guess using the BOM" encoding,
 Python should *not* attempt to verify that the file is properly
encoded:  it should check for (and consume) any BOM, and then return a
stream which uses the encoding inferred from the BOM.  Any errors should
be handled later, when characters are read, exactly as if the file had
been opened with the same encoding guessed from the BOM.

> I understood this proposal as a general processing guideline, not
> something the io library should do (but, say, a text editor).
> 
> FWIW, I'm personally in favor of using the UTF-8 signature. If people
> consider them crazy talk, that may be because UTF-8 can't possibly have
> a byte order - hence I call it a signature, not the BOM. As a signature,
> I don't consider it crazy at all. There is a long tradition of having
> magic bytes in files (executable files, Postscript, PDF, ... - see
> /etc/magic). Having a magic byte sequence for plain text to denote the
> encoding is useful and helps reducing moji-bake. This is the reason it's
> used on Windows: notepad would normally assume that text is in the ANSI
> code page, and for compatibility, it can't stop doing that. So the UTF-8
> signature gives them an exit strategy.

Agreed.  Having that marker at the start of the file makes interop with
other tools *much* easier.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [email protected]
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHoFMACgkQ+gerLs4ltQ73dACffwUfyh6Q9vUnKYf367QFjNcU
RRMAoNuKCWEx7j+MSdTv+UjhAPynBc14
=uAX6
-END PGP SIGNATURE-

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Eric Smith
>> Shouldn't this encoding guessing be a separate function that you call
>> on either a file or a seekable stream ?
>>
>> After all, detecting encodings is just as useful to have for non-file
>> streams.
>
> Other stream sources typically have out-of-band ways to signal the
> encoding:  only when reading from the filesystem do we pretty much
> *have* to guess, and in that case the BOM / signature is the best
> heuristic we have.  Also, some non-file streams are not seekable, and so
> can't be guessed via a pre-pass.

But what if the file were in (for example) a zip file? I think you
definitely want to have access to this functionality outside of open().

Eric.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread James Y Knight

On Jan 8, 2010, at 4:14 PM, Tres Seaver wrote:

I understood this proposal as a general processing guideline, not
something the io library should do (but, say, a text editor).

FWIW, I'm personally in favor of using the UTF-8 signature. If people
consider them crazy talk, that may be because UTF-8 can't possibly  
have
a byte order - hence I call it a signature, not the BOM. As a  
signature,

I don't consider it crazy at all. There is a long tradition of having
magic bytes in files (executable files, Postscript, PDF, ... - see
/etc/magic). Having a magic byte sequence for plain text to denote  
the
encoding is useful and helps reducing moji-bake. This is the reason  
it's
used on Windows: notepad would normally assume that text is in the  
ANSI
code page, and for compatibility, it can't stop doing that. So the  
UTF-8

signature gives them an exit strategy.


Agreed.  Having that marker at the start of the file makes interop  
with

other tools *much* easier.


Putting the BOM at the beginning of UTF-8 text files is not a good  
idea, it makes interop much *worse* on a unix system, not better.  
Without the BOM, most commands do the right thing with UTF-8 text.  
E.g. to concatenate two files:


$ cat file-1 file-2 > file-3

With a BOM at the beginning of the file, it won't work right. Of  
course, you could modify "cat" (and every other stream processing  
command) to know how to consume and emit BOMs, and omit the extra one  
that would show up in the middle of the stream...but even that can't  
work; what about:

$ (cat file-1; cat file-2) > file-3.

Should the shell now know that when you run multiple commands, it  
should eat the BOM emitted from the second command?


Basically, using a BOM in a utf-8 file is just not a good idea: it  
completely ruins interop with every standard unix tool.


This is not to say that Python shouldn't have a way to read a file  
with a UTF-8 BOM: it just shouldn't encourage you to *write* such files.


James
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread M.-A. Lemburg
Tres Seaver wrote:
> M.-A. Lemburg wrote:
> 
>> Shouldn't this encoding guessing be a separate function that you call
>> on either a file or a seekable stream ?
> 
>> After all, detecting encodings is just as useful to have for non-file
>> streams.
> 
> Other stream sources typically have out-of-band ways to signal the
> encoding:  only when reading from the filesystem do we pretty much
> *have* to guess, and in that case the BOM / signature is the best
> heuristic we have.  Also, some non-file streams are not seekable, and so
> can't be guessed via a pre-pass.

Sure there are non-seekable file streams, but at least when
using StringIO-type streams you don't have that restriction.

An encoding detection function would provide more use in other
cases as well, so instead of hiding away the functionality in
the open() constructor, I'm suggesting to make expose it via
the codecs module.

Applications can then use it where necessary and also provide their
own defaults, using other heuristics.

>> You'd then avoid having to stuff everything into
>> a single function call and also open up the door for more complex
>> application specific guess work or defaults.
> 
>> The whole process would then have two steps:
> 
>>  1. guess encoding
> 
>>   import codecs
>>   encoding = codecs.guess_file_encoding(filename)
> 
> Filename is not enough information:  or do you mean that API to actually
> open the stream?

Yes. The API would open the file, guess the encoding and then
close it again. If you don't want that to happen, you could use
the second API I mentioned below on the already open file.

Note that this function could detect a lot more encodings with
reasonably high probability than just BOM-prefixed ones,
e.g. we could also add support to detect encoding declarations
such as the ones we use in Python source files.

>>  2. open the file with the found encoding
> 
>>   f = open(filename, encoding=encoding)
> 
>> For seekable streams f, you'd have:
> 
>>  1. guess encoding
> 
>>   import codecs
>>   encoding = codecs.guess_stream_encoding(f)

I forgot to mention: This API needs to position the file pointer
to the start of the first data byte.

>>  2. wrap the stream with a reader for the found encoding
> 
>>   reader_class = codecs.getreader(encoding)
>>   g = reader_class(f)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 08 2010)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Eric Smith wrote:
>>> Shouldn't this encoding guessing be a separate function that you call
>>> on either a file or a seekable stream ?
>>>
>>> After all, detecting encodings is just as useful to have for non-file
>>> streams.
>> Other stream sources typically have out-of-band ways to signal the
>> encoding:  only when reading from the filesystem do we pretty much
>> *have* to guess, and in that case the BOM / signature is the best
>> heuristic we have.  Also, some non-file streams are not seekable, and so
>> can't be guessed via a pre-pass.
> 
> But what if the file were in (for example) a zip file? I think you
> definitely want to have access to this functionality outside of open().

If the application expects a possibly-BOM-signature-marked file, but you
pass it mismatched garbage:

  >>> f = open('some.zip', encoding='BOM")

the error handling should be the same as if you passed any other
mismatched encoding:

  >>> f = open('some.zip', encoding='UTF8')

i.e., you discover the error when you try to read from the (non)encoded
stream, not when you open it.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [email protected]
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHqpwACgkQ+gerLs4ltQ7uAACeKEc+WT4TASGcVl1Hfqe6L9La
I6EAn1pJtngtLWPdothGbYB+zUabEvTW
=TjBK
-END PGP SIGNATURE-

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 22:40:47, Eric Smith a écrit :
> >> Shouldn't this encoding guessing be a separate function that you call
> >> on either a file or a seekable stream ?
> >>
> >> After all, detecting encodings is just as useful to have for non-file
> >> streams.
> >
> > Other stream sources typically have out-of-band ways to signal the
> > encoding:  only when reading from the filesystem do we pretty much
> > *have* to guess, and in that case the BOM / signature is the best
> > heuristic we have.  Also, some non-file streams are not seekable, and so
> > can't be guessed via a pre-pass.
> 
> But what if the file were in (for example) a zip file? I think you
> definitely want to have access to this functionality outside of open().

FYI my patch (encoding="BOM") is implemented in TextIOWrapper, and 
TextIOWrapper takes a binary stream as input, not a filename.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt

2010-01-08 Thread Yoann Padioleau

On Jan 7, 2010, at 1:16 PM, Martin v. Löwis wrote:

>>> astgen.py is not used to process asdl files; ast.txt lives right
>>> next to astgen.py. Instead, the asdl file is processed by
>>> Parser/asdl_c.py.
>> 
>> Yes, I know that. That's why I asked about the relation between
>> ast.txt and Python.adsl. If internally the parser uses the .adsl, but
>> expose as a reflection mechanism things that were generated from
>> ast.txt, then there could be a mismatch. Where does ast.txt comes
>> from ? Shouldn't it be generated itself from Python.adsl ?
> 
> What you may not be aware of is that Tools/compiler (and the
> compiler package that it builds on) are both unused and unmaintained.

I see. So if people want to analyze python code they have to use
other tools (like rope?) ? or use reflection ?

> 
> If the package stops working correctly - tough luck.
> 
>> So we would have
>> 
>> Python.adsl > ast.txt  astgen.py --->  ast.py
>> containing all the UnarySub, Expression, classes that represents a
>> Python AST.
> 
> No - what actually happens in Python 3.x is this: both the compiler
> package and Tools/compiler are removed.

Ok. I will then create my own ast classes generator.

Thanks.


> 
> Regards,
> Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Georg Brandl
Am 08.01.2010 22:14, schrieb Tres Seaver:

>> FWIW, I'm personally in favor of using the UTF-8 signature. If people
>> consider them crazy talk, that may be because UTF-8 can't possibly have
>> a byte order - hence I call it a signature, not the BOM. As a signature,
>> I don't consider it crazy at all. There is a long tradition of having
>> magic bytes in files (executable files, Postscript, PDF, ... - see
>> /etc/magic). Having a magic byte sequence for plain text to denote the
>> encoding is useful and helps reducing moji-bake. This is the reason it's
>> used on Windows: notepad would normally assume that text is in the ANSI
>> code page, and for compatibility, it can't stop doing that. So the UTF-8
>> signature gives them an exit strategy.
> 
> Agreed.  Having that marker at the start of the file makes interop with
> other tools *much* easier.

Except if only 50% of the other tools support the signature.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt

2010-01-08 Thread Martin v. Löwis
> I see. So if people want to analyze python code they have to use
> other tools (like rope?) ? or use reflection ?

Correct. One such tool might be the true Python compiler, along
with the _ast module.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Victor Stinner
Hi,

Thanks for all the answers! I will try to sum up all ideas here.


(1) Change default open() behaviour or make it optional?

Guido would like to add an option and keep open() unchanged. He wrote that 
checking for BOM and using system locale are too much different to be the same 
option (encoding=None).

Antoine would like to check BOM by default, because both options (system 
locale vs checking for BOM) is the same thing.

About Antoine choice (encoding=None): which file modes would check for a BOM? 
I would like to answer only the read only mode, but then open(filename, "r") 
and open(filename, "r+") would behave differently?

  => 1 point for Guido (encoding="BOM" is more explicit)

Antoine choice has the advantage of directly support UTF-8+BOM, UTF-16 and 
UTF-32 for all applications and all modules using open(filename).

  => 1 point for Antoine


(2) Check for a BOM while reading or detect it before?

Everybody agree that checking BOM is an interesting option and should not be 
limited to open().

Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file 
name or a binary file-like object: it returns the encoding and seek to the 
file start or just after the BOM.

I dislike this function because it requires extra file operations (open 
(optional), read() and seek()) and it doesn't work if the file is not seekable 
(eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to 
avoid extra file operations.

Note: I implemented the BOM check in TextIOWrapper; so it's already usable for 
any file-like object.


(3) tell() and seek() on a text file starting with a BOM

To be consistent with Antoine example:

   >>> bio = io.BytesIO(b'\xff\xfea\x00b\x00')
   >>> f = io.TextIOWrapper(bio, encoding='utf-16')
   >>> f.read()
   'ab'
   >>> f.seek(0)
   0
   >>> f.read()
   'ab'

TextIOWrapper:

 * tell() should return zero at file start,
 * seek(0) should go be to file start,
 * and the BOM should always be "ignored".

I mean:

  with open("utf8bom.txt", encoding="BOM") as fp:
 assert fp.tell() == 0   
 text = fp.read() # no BOM here
 fp.seek(0)
 assert fp.read() == text

--

About my patch:

 - BOM check is explicit: open(filebame,  encoding="BOM")
 - tell() / seek(0) works as expected
 - BOM bytes are always skipped in read() / readlines() result

Hum, I don't know if this email can be called a sum up ;-)

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Antoine Pitrou
Hello Victor,

Victor Stinner  haypocalc.com> writes:
> 
> (1) Change default open() behaviour or make it optional?
> 
[...]
> 
> Antoine would like to check BOM by default, because both options (system 
> locale vs checking for BOM) is the same thing.

To be clear, I am not saying it is the same thing. What I think is that it would
be a mistake to use a mildly unreliable heuristic by default (the locale +
device encoding heuristic) but refuse to trust a more reliable heuristic (the
BOM-based detection algorithm).

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Michael Foord

On 09/01/2010 00:09, Antoine Pitrou wrote:

Hello Victor,

Victor Stinner  haypocalc.com>  writes:
   

(1) Change default open() behaviour or make it optional?

 

[...]
   

Antoine would like to check BOM by default, because both options (system
locale vs checking for BOM) is the same thing.
 

To be clear, I am not saying it is the same thing. What I think is that it would
be a mistake to use a mildly unreliable heuristic by default (the locale +
device encoding heuristic) but refuse to trust a more reliable heuristic (the
BOM-based detection algorithm).
   


I concur. On Windows both UTF-8 and signature are very common, yet the 
platform default is the truly awful CP1252.


All the best,

Michael

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] --enabled-shared broken on freebsd5?

2010-01-08 Thread Floris Bruynooghe
On Fri, Jan 08, 2010 at 10:11:51AM +0100, "Martin v. Löwis" wrote:
> Nicholas Bastin wrote:
> > I think this problem probably needs to move over to distutils-sig, as
> > it doesn't seem to be specific to the way that Python itself uses
> > distutils.
> 
> I'm fairly skeptical that anybody on distutils SIG is interested in
> details of the Python build process...

Uh, hum.  Unfounded skepticism.  ;-)
But as said filing a bug sounds better in this case.

Regards
Floris

-- 
Debian GNU/Linux -- The Power of Freedom
www.debian.org | www.gnu.org | www.kernel.org
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Glenn Linderman
On approximately 1/8/2010 3:59 PM, came the following characters from 
the keyboard of Victor Stinner:

Hi,

Thanks for all the answers! I will try to sum up all ideas here.


One concern I have with this implementation encoding="BOM" is that if 
there is no BOM it assumes UTF-8.  That is probably a good assumption in 
some circumstances, but not in others.


* It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
encoded files include a BOM.  It is only required that UTF-16 and UTF-32 
(cases where the endianness is unspecified) contain a BOM.  Hence, it 
might be that someone would expect a UTF-16LE (or any of the formats 
that don't require a BOM, rather than UTF-8), but be willing to accept 
any BOM-discriminated format.


* Potentially, this could be expanded beyond the various Unicode 
encodings... one could envision that a program whose data files 
historically were in any particular national language locale, could want 
to be enhance to accept Unicode, and could declare that they will accept 
any BOM-discriminated format, but want to default, in the absence of a 
BOM, to the original national language locale that they historically 
accepted.  That would provide a migration path for their old data files.


So the point is, that it might be nice to have 
"BOM-otherEncodingForDefault" for each other encoding that Python 
supports.  Not sure that is the right API, but I think it is expressive 
enough to handle the cases above.  Whether the cases solve actual 
problems or not, I couldn't say, but they seem like reasonable cases.


It would, of course, be nicest if OS metadata had been invented way back 
when, for all OSes, such that all text files were flagged with their 
encoding... then languages could just read the encoding and do the right 
thing! But we live in the real world, instead.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread MRAB

Glenn Linderman wrote:
On approximately 1/8/2010 3:59 PM, came the following characters from 
the keyboard of Victor Stinner:

Hi,

Thanks for all the answers! I will try to sum up all ideas here.


One concern I have with this implementation encoding="BOM" is that if 
there is no BOM it assumes UTF-8.  That is probably a good assumption in 
some circumstances, but not in others.


* It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
encoded files include a BOM.  It is only required that UTF-16 and UTF-32 
(cases where the endianness is unspecified) contain a BOM.  Hence, it 
might be that someone would expect a UTF-16LE (or any of the formats 
that don't require a BOM, rather than UTF-8), but be willing to accept 
any BOM-discriminated format.


* Potentially, this could be expanded beyond the various Unicode 
encodings... one could envision that a program whose data files 
historically were in any particular national language locale, could want 
to be enhance to accept Unicode, and could declare that they will accept 
any BOM-discriminated format, but want to default, in the absence of a 
BOM, to the original national language locale that they historically 
accepted.  That would provide a migration path for their old data files.


So the point is, that it might be nice to have 
"BOM-otherEncodingForDefault" for each other encoding that Python 
supports.  Not sure that is the right API, but I think it is expressive 
enough to handle the cases above.  Whether the cases solve actual 
problems or not, I couldn't say, but they seem like reasonable cases.


It would, of course, be nicest if OS metadata had been invented way back 
when, for all OSes, such that all text files were flagged with their 
encoding... then languages could just read the encoding and do the right 
thing! But we live in the real world, instead.



What about listing the possible encodings? It would try each in turn
until it found one where the BOM matched or had no BOM:

my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')

or is that taking it too far?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Martin v. Löwis
>>> Antoine would like to check BOM by default, because both options
>>> (system locale vs checking for BOM) is the same thing.
>>> 
>> To be clear, I am not saying it is the same thing. What I think is 
>> that it would be a mistake to use a mildly unreliable heuristic by
>> default (the locale + device encoding heuristic) but refuse to
>> trust a more reliable heuristic (the BOM-based detection
>> algorithm).
>> 
> 
> I concur. On Windows both UTF-8 and signature are very common, yet
> the platform default is the truly awful CP1252.

While I would support combining BOM detection in the case where a file
is opened for reading and no encoding is specified, I see two problems:
a) if a seek operations is performed before having looked at the BOM,
   no determination would have been made
b) what encoding should it use on writing?

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Glenn Linderman
On approximately 1/8/2010 5:12 PM, came the following characters from 
the keyboard of MRAB:

Glenn Linderman wrote:
On approximately 1/8/2010 3:59 PM, came the following characters from 
the keyboard of Victor Stinner:

Hi,

Thanks for all the answers! I will try to sum up all ideas here.


One concern I have with this implementation encoding="BOM" is that if 
there is no BOM it assumes UTF-8.  That is probably a good assumption 
in some circumstances, but not in others.


* It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
encoded files include a BOM.  It is only required that UTF-16 and 
UTF-32 (cases where the endianness is unspecified) contain a BOM.  
Hence, it might be that someone would expect a UTF-16LE (or any of 
the formats that don't require a BOM, rather than UTF-8), but be 
willing to accept any BOM-discriminated format.


* Potentially, this could be expanded beyond the various Unicode 
encodings... one could envision that a program whose data files 
historically were in any particular national language locale, could 
want to be enhance to accept Unicode, and could declare that they 
will accept any BOM-discriminated format, but want to default, in the 
absence of a BOM, to the original national language locale that they 
historically accepted.  That would provide a migration path for their 
old data files.


So the point is, that it might be nice to have 
"BOM-otherEncodingForDefault" for each other encoding that Python 
supports.  Not sure that is the right API, but I think it is 
expressive enough to handle the cases above.  Whether the cases solve 
actual problems or not, I couldn't say, but they seem like reasonable 
cases.


It would, of course, be nicest if OS metadata had been invented way 
back when, for all OSes, such that all text files were flagged with 
their encoding... then languages could just read the encoding and do 
the right thing! But we live in the real world, instead.



What about listing the possible encodings? It would try each in turn
until it found one where the BOM matched or had no BOM:

my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')

or is that taking it too far?


That sounds very flexible -- but in net effect it would only make 
illegal a subset of the BOM-containing encodings (those not listed) 
without making legal any additional encodings other than the non-BOM 
encoding.  Whether prohibiting a subset of BOM-containing encodings is a 
useful use case, I couldn't say... but my goal would be to included as 
many different file encodings on input as possible: without a BOM, that 
is exactly 1 (unless there are other heuristics), with a BOM, it is 
1+all-BOM-containing encodings.  Your scheme would permit numbers of 
encodings accepted to vary between 1 and 1+all-BOM-containing encodings.


(I think everyone can agree there are 5 different byte sequences that 
can be called a Unicode BOM.  The likelihood of them appearing in any 
other text encoding created by mankind depends on those other encodings 
-- but it is not impossible.  It is truly up to the application to 
decide whether BOM detection could potentially conflict with files in 
some other encoding that would be acceptable to the application.)


So I think it is taking it further than I can see value in, but I'm 
willing to be convinced otherwise.  I see only a need for detecting BOM, 
and specifying a default encoding to be used if there is no BOM.  Note 
that it might be nice to have a specification for using current 
encoding=None heuristic -- perhaps encoding="BOM-None" per my originally 
proposed syntax.  But I'm still not saying that is the best syntax.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Nick Coghlan
MRAB wrote:
> Maybe there should also be a way of determining what encoding it decided
> it was, so that you can then write a new file in that same encoding.

I thought of that question as well - the f.encoding attribute on the
opened file should be sufficient.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
---
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Lennart Regebro
It seems to me that when opening a file, the following is the only
flow that makes sense for the typical opening of a file flow:

if encoding is not None:
   use encoding
elif file has BOM:
   use BOM
else:
   use system default

And hence a encoding='BOM' isn't needed there. Although I'm trying to
come up with usecases that doesn't work with this, I can't. :)

BUT

When writing things are not so easy though. Apparently some encodings
require a BOM to be written, but others do not, but allow it, and some
has no byte order mark. So there you have to be able to write the BOM,
or not. And that's either a new parameter, because you can't use
encoding='BOM' since you need to specify the encoding as well, or a
new method.

I would suggest a BOM parameter, and maybe a method as  well.

BOM=None|True|False

Where "None" means a sane default behaviour, that is write a BOM if
the encoding require it.
"True" means write a BOM if the encoding *supports* it.
"False" means Don't write a BOM even if the encoding requires it
(because I know what I'm doing)

if 'w' in mode: # But not 'r' or 'a'
if BOM == True and encoding in (ENCODINGS THAT ALLOW BOM):
write_bom = True
elif BOM == False:
   write_bom = False
elif BOM == None and encoding in (ENCODINGS THAT REQUIRE BOM):
  write_bom = True
else:
  write_bom = False
else:
write_bom = False

For reading this parameter could either be a noop, or possibly change
the behavior somehow, if a usecase where that makes sense can be
imagined.

-- 
Lennart Regebro: http://regebro.wordpress.com/
Python 3 Porting: http://python-incompatibility.googlecode.com/
+33 661 58 14 64
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com