Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald
On 09.01.10 14:38, Victor Stinner wrote:

 Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit :
 Good idea, I choosed open(filename, encoding=BOM).

 On the surface this looks like there's an encoding named BOM, but
 looking at your patch I found that the check is still done in
 TextIOWrapper. IMHO the best approach would to the implement a *real*
 codec named BOM (or sniff). This doesn't require *any* changes to
 the IO library. It could even be developed as a standalone project and
 published in the Cheeseshop.
 
 Why not, this is another solution to the point (2) (Check for a BOM while 
 reading or detect it before?). Which encoding would be used if there is not 
 BOM? UTF-8 sounds like a good choice.

UTF-8 might be a good choice, are the failback could be specified in the
encoding name, i.e.

   open(file.txt, encoding=BOM-UTF-8)

falls back to UTF-8, if there's no BOM at the start.

This could be implemented via a custom codec search function (see
http://docs.python.org/library/codecs.html#codecs.register for more info).

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 11:37, Walter Dörwald wal...@livinglogic.de wrote:
 UTF-8 might be a good choice

No, fallback if there is no BOM should be the local settings, just as
fallback is today if you don't specify a codec.
I mean, what if you want to look for a BOM but fall back to something
else? How far will we go with encoding special information in the
codecs names? codec='BOM else UTF-16 else locale'? :-)

BOM is not a locale, and should not be a locale. Having a locale
called BOM is wrong per se. It should either be default to look for a
BOM when codec=None, or a special parameter. If none of these are
desired, then we need a special function that takes a filename or file
handle, and looks for a BOM and returns the codec found or None. But
I find that much less natural and obvious than checking the BOM when codec=None.

-- 
Lennart Regebro: http://regebro.wordpress.com/
Python 3 Porting: http://python-incompatibility.googlecode.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 12:12, Lennart Regebro rege...@gmail.com wrote:
 BOM is not a locale, and should not be a locale. Having a locale
 called BOM is wrong per se.

D'oh! I mean codec here obviously.
-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald
On 10.01.10 00:40, Martin v. Löwis wrote:
 How does the requirement that it be implemented as a codec miss the
 point?

 If we want it to be the default, it must be able to fallback on the current
 locale-based algorithm if no BOM is found. I don't think it would be easy 
 for a
 codec to do that.
 
 Yes - however, Victor currently apparently *doesn't* want it to be the
 default, but wants the user to specify encoding=BOM. If so, it isn't
 the default, and it is easy to implement as a codec.
 
 FWIW, I agree with Walter that if it is provided through the encoding=
 argument, it should be a codec. If it is built into the open function
 (for whatever reason), it must be provided by some other parameter.

 Why not simply encoding=None?
 
 I don't mind. Please re-read Walter's message - it only said that
 *if* this is activated through encoding=BOM, *then* it must be
 a codec, and could be on PyPI. I don't think Walter was talking about
 the case it is not activated through encoding='BOM' *at all*.

However if this autodetection feature is useful in other cases (no
matter how it's activated), it should be a codec, because as part of the
open() function it isn't reusable.

 The default value should provide the most useful
 behaviour possible. Forcing users to choose between two different 
 autodetection
 strategies (encoding=None and another one) is a little insane IMO.

And encoding=mbcs is a third option on Windows.

 That wouldn't disturb me much. There are a lot of things in that area
 that are a little insane, starting with Microsoft Windows :-)

;)

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Antoine Pitrou

 However if this autodetection feature is useful in other cases (no
 matter how it's activated), it should be a codec, because as part of the
 open() function it isn't reusable.

It is reusable as part of io.TextIOWrapper, though.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 13:29, Walter Dörwald wal...@livinglogic.de wrote:
 However if this autodetection feature is useful in other cases (no
 matter how it's activated), it should be a codec, because as part of the
 open() function it isn't reusable.

But an autodetect feature is not a codec. Sure it should be reusable,
but making it a codec seems to be  a weird hack to me. And how would
you reuse it if it was a codec? A reusable autodetect feature would be
useable to detect what codec it is. A autodetect codec would not be
useful for that, as it would simply just decode.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald
On 11.01.10 13:45, Lennart Regebro wrote:

 On Mon, Jan 11, 2010 at 13:29, Walter Dörwald wal...@livinglogic.de wrote:
 However if this autodetection feature is useful in other cases (no
 matter how it's activated), it should be a codec, because as part of the
 open() function it isn't reusable.
 
 But an autodetect feature is not a codec. Sure it should be reusable,
 but making it a codec seems to be  a weird hack to me.

I think we already had this discussion two years ago in the context of
XML decoding ;):

http://mail.python.org/pipermail/python-dev/2007-November/075138.html

 And how would
 you reuse it if it was a codec? A reusable autodetect feature would be
 useable to detect what codec it is. A autodetect codec would not be
 useful for that, as it would simply just decode.

I have implemented an XML codec (as part of XIST:
http://pypi.python.org/pypi/ll-xist), that can do that:

 from ll import xml_codec
 import codecs
 c = codecs.getincrementaldecoder(xml)()
 c.encoding
 c.decode(?xml)
u''
 c.encoding
 c.decode( version='1.0')
u''
 c.encoding
 c.decode( encoding='iso-8859-1'?)
u?xml version='1.0' encoding='iso-8859-1'?
 c.encoding
'iso-8859-1'

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 14:21, Walter Dörwald wal...@livinglogic.de wrote:
 I think we already had this discussion two years ago in the context of
 XML decoding ;):

Yup. Ans Martins answer then is my answer now:

 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?

Exactly so. This functionality just *isn't* a codec - there is no
encoding. Instead, it is an algorithm for *detecting* an encoding.

The conclusion was that a method do autodetect encodings would be
good. I think the same conclusion applies here.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Martin v. Löwis
 But an autodetect feature is not a codec. Sure it should be reusable,
 but making it a codec seems to be  a weird hack to me.

Well, the existing UTF-16 codec also is an autodetect feature (to
detect the endianness), and I don't consider it a weird hack.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 18:16, Martin v. Löwis mar...@v.loewis.de wrote:
 But an autodetect feature is not a codec. Sure it should be reusable,
 but making it a codec seems to be  a weird hack to me.

 Well, the existing UTF-16 codec also is an autodetect feature (to
 detect the endianness), and I don't consider it a weird hack.

So the BOM codec should raise a UnicodeDecodeError if there is no BOM?
Because that's what it would have to do, in that case, because it
can't fall back on anything, it has to handle and implement all
encodings that have a BOM. And is it then actually very useful? You
would have to do a try/except first with encoding='BOM' and then
encoding=None to get the fallback to the standard.


I must say that I find this whole thing pretty obvious. 'BOM' is not
an encoding. Either there should be a method to get the encoding from
the BOM, returning None of there isn't one, or open() should look at
the BOM when you pass in encoding=None. Or both.

That covers all usecases, is easy and obvious. Either open(file=foo,
encoding=None) or open(file, encoding=encoding_from_bom(file))

I can't see that open(file, encoding='BOM') has any benefit over this,
covers any extra usecase and is clearer in any way. Instead it adds
something confusing: An encoding that isn't an encoding.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread MRAB

Lennart Regebro wrote:

On Mon, Jan 11, 2010 at 11:37, Walter Dörwald wal...@livinglogic.de wrote:

UTF-8 might be a good choice


No, fallback if there is no BOM should be the local settings, just as
fallback is today if you don't specify a codec.
I mean, what if you want to look for a BOM but fall back to something
else? How far will we go with encoding special information in the
codecs names? codec='BOM else UTF-16 else locale'? :-)

BOM is not a locale, and should not be a locale. Having a locale
called BOM is wrong per se. It should either be default to look for a
BOM when codec=None, or a special parameter. If none of these are
desired, then we need a special function that takes a filename or file
handle, and looks for a BOM and returns the codec found or None. But
I find that much less natural and obvious than checking the BOM when codec=None.


Or pass a function that accepts a byte stream or the first few bytes and
returns the encoding and any unused bytes (because the byte stream might
not be seekable)?

def guess_encoding(byte_stream):
data = byte_stream.read(2)
if data == b\xFE\xFF:
return UTF-16BE, b
return UTF-8, data

text_file = open(filename, encoding=guess_encoding)
...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Martin v. Löwis
 I must say that I find this whole thing pretty obvious. 'BOM' is not
 an encoding.

That I certainly agree with.

 That covers all usecases, is easy and obvious. Either open(file=foo,
 encoding=None) or open(file, encoding=encoding_from_bom(file))
 
 I can't see that open(file, encoding='BOM') has any benefit over this,

Well, it would have the advantage that Walter pointed out: you can
implement it independent of the open() implementation, and even provide
it in older versions of Python.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Olemis Lang
 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:
 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8
 file starting with a BOM, read()/readline() returns also the BOM whereas the
 BOM should be ignored.

[...]


I had similar issues too (please read below ;o) ...

On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum gu...@python.org wrote:
 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?


About guessing the encoding, I experienced this issue while I was
developing a Trac plugin. What I was doing is as follows :

- I guessed the MIME type + charset encoding using Trac MIME API (it
was a CSV file encoded using UTF-16)
- I read the file using `open`
- Then wrapped the file using `codecs.EncodedFile`
- Then used `csv.reader`

... and still get the BOM in the first value of the first row in the CSV file.

{{{
#!python

 mimetype
'utf-16-le'
 ef = EncodedFile(f, 'utf-8', mimetype)
}}}

IMO I think I am +1 for leaving `open` just like it is, and use module
`codecs` to deal with encodings, but I am strongly -1 for returning
the BOM while using `EncodedFile` (mainly because encoding is
explicitly supplied in ;o)

 --Guido


CMIIW anyway ...

-- 
Regards,

Olemis.

Blog ES: http://simelo-es.blogspot.com/
Blog EN: http://simelo-en.blogspot.com/

Featured article:
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread M.-A. Lemburg
Olemis Lang wrote:
 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:
 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8
 file starting with a BOM, read()/readline() returns also the BOM whereas the
 BOM should be ignored.

 [...]

 
 I had similar issues too (please read below ;o) ...
 
 On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum gu...@python.org wrote:
 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

 
 About guessing the encoding, I experienced this issue while I was
 developing a Trac plugin. What I was doing is as follows :
 
 - I guessed the MIME type + charset encoding using Trac MIME API (it
 was a CSV file encoded using UTF-16)
 - I read the file using `open`
 - Then wrapped the file using `codecs.EncodedFile`
 - Then used `csv.reader`
 
 ... and still get the BOM in the first value of the first row in the CSV file.

You didn't say, but I presume that the charset guessing logic
returned either 'utf-16-le' or 'utf-16-be' - those encodings don't
remove the leading BOM. The 'utf-16' codec will remove the BOM.

 {{{
 #!python
 
 mimetype
 'utf-16-le'
 ef = EncodedFile(f, 'utf-8', mimetype)
 }}}

Same here: the UTF-8 codec will not remove the BOM, you have
to use the 'utf-8-sig' codec for that.

 IMO I think I am +1 for leaving `open` just like it is, and use module
 `codecs` to deal with encodings, but I am strongly -1 for returning
 the BOM while using `EncodedFile` (mainly because encoding is
 explicitly supplied in ;o)

Note that EncodedFile() doesn't do any fancy BOM detection or
filtering. This is the job of the codecs.

Also note that BOM removal is only valid at the beginning of
a file. All subsequent BOM-bytes have to be read as-is (they
map to a zero-width non-breaking space) - without removing them.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 11 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Olemis Lang
Probably one part of this is OT , but I think it could complement the
discussion ;o)

On Mon, Jan 11, 2010 at 3:44 PM, M.-A. Lemburg m...@egenix.com wrote:
 Olemis Lang wrote:
 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:
 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8
 file starting with a BOM, read()/readline() returns also the BOM whereas 
 the
 BOM should be ignored.

 [...]


 I had similar issues too (please read below ;o) ...

 On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum gu...@python.org wrote:
 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?


 About guessing the encoding, I experienced this issue while I was
 developing a Trac plugin. What I was doing is as follows :

 - I guessed the MIME type + charset encoding using Trac MIME API (it
 was a CSV file encoded using UTF-16)
 - I read the file using `open`
 - Then wrapped the file using `codecs.EncodedFile`
 - Then used `csv.reader`

 ... and still get the BOM in the first value of the first row in the CSV 
 file.

 You didn't say, but I presume that the charset guessing logic
 returned either 'utf-16-le' or 'utf-16-be'

Yes. In fact they return the full mimetype 'text/csv; charset=utf-16-le' ... ;o)

 - those encodings don't
 remove the leading BOM.

... and they should ?

 The 'utf-16' codec will remove the BOM.


In this particular case there's nothing I can do, I have to process
whatever charset is detected in the input ;o)

 {{{
 #!python

 mimetype
 'utf-16-le'
 ef = EncodedFile(f, 'utf-8', mimetype)
 }}}

 Same here: the UTF-8 codec will not remove the BOM, you have
 to use the 'utf-8-sig' codec for that.

 IMO I think I am +1 for leaving `open` just like it is, and use module
 `codecs` to deal with encodings, but I am strongly -1 for returning
 the BOM while using `EncodedFile` (mainly because encoding is
 explicitly supplied in ;o)

 Note that EncodedFile() doesn't do any fancy BOM detection or
 filtering.

... directly.

 This is the job of the codecs.


OK ... to come back to the scope of the subject, in the general case,
I think that BOM (if any) should be handled by codecs, and therefore
indirectly by EncodedFile . If that's a explicit way of working with
encodings I'd prefer to use that wrapper explicitly in order to
(encode | decode) the file and let the codec detect whether there's a
BOM or not and «adjust» `tell`, `read` and everything else in that
wrapper (instead of `open`).

 Also note that BOM removal is only valid at the beginning of
 a file. All subsequent BOM-bytes have to be read as-is (they
 map to a zero-width non-breaking space) - without removing them.


;o)

-- 
Regards,

Olemis.

Blog ES: http://simelo-es.blogspot.com/
Blog EN: http://simelo-en.blogspot.com/

Featured article:
Test cases for custom query (i.e report 9) ... PASS (1.0.0)  -
http://simelo.hg.sourceforge.net/hgweb/simelo/trac-gviz/rev/d276011e7297
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-10 Thread Henning von Bargen

If Python should support BOM when reading text files,
it should also be able to *write* such files.

An encoding=BOM argument wouldn't help here, because
it does not specify which encoding to use actually:
UFT-8, UTF-16-LE or what?

That would be a point against encoding=BOM and
pro an additional keyword argument use_bom or whatever
with the following values:

None: default (old) behaviour: don't handle BOM at all

True: reading: expect BOM (raising an exception if it's
   missing). The encoding argument must be None
   or it must match the encoding implied by the
   BOM
  writing: write a BOM. The encoding argument must be
   one of the UTF encodings.
False: reading: If a BOM is present, use it to determine the
   file encoding. The encoding argument must
   be None or it must match the encoding implied by
   the BOM. (*)
   Otherwise, use the encoding argument to determine
   the encoding.
   writing: do not write a BOM. Use the encoding argument.

(*) This is a question of taste. I think some people would prefer
a fourth value AUTO instead, or to swap the behaviour of
None and False.

Henning

P.S. To make things worse, I have sometimes seen XML files with a
UTF-8 BOM, but an XML encoding declaration of iso-8859-1.
For such files, whatever you guess will be wrong anyway...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-10 Thread Lennart Regebro
On Sun, Jan 10, 2010 at 12:10, Henning von Bargen
henning.vonbar...@arcor.de wrote:
 If Python should support BOM when reading text files,
 it should also be able to *write* such files.

That's what I thought too. Turns out the UTF-16 does write such a
mark. You also have the constants in the codecs module, so you can
write the utf-16-le BOM and then use the utf-16-le encoding if you
want to be sure you write utf-16-le, and the same with BE, of course.

I still think now using BOM's when determining the file format can be
seen as a bug, though, so I don't think the API needs to change at
all.
-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Walter Dörwald

Victor Stinner wrote:

Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :

Builtin open() function is unable to open an UTF-16/32 file starting with
a BOM if the encoding is not specified (raise an unicode error). For an
UTF-8 file starting with a BOM, read()/readline() returns also the BOM
whereas the BOM should be ignored.

It depends. If you use the utf-8-sig encoding, it *will* ignore the
UTF-8 signature.


Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and 
UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to 
remove the BOM after the first read (much harder if you use a module like 
ConfigParser or csv).



Since my proposition changes the result TextIOWrapper.read()/readline()
for files starting with a BOM, we might introduce an option to open() to
enable the new behaviour. But is it really needed to keep the backward
compatibility?

Absolutely. And there is no need to produce a new option, but instead
use the existing options: define an encoding that auto-detects the
encoding from the family of BOMs. Maybe you call it encoding=sniff.


Good idea, I choosed open(filename, encoding=BOM).


On the surface this looks like there's an encoding named BOM, but 
looking at your patch I found that the check is still done in 
TextIOWrapper. IMHO the best approach would to the implement a *real* 
codec named BOM (or sniff). This doesn't require *any* changes to 
the IO library. It could even be developed as a standalone project and 
published in the Cheeseshop.


To see how something like this can be done, take a look at the UTF-16 
codec, that switches to bigendian or littleendian mode depending on the 
first read/decode call.


Servus,
   Walter





___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Victor Stinner
Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit :
  Good idea, I choosed open(filename, encoding=BOM).
 
 On the surface this looks like there's an encoding named BOM, but
 looking at your patch I found that the check is still done in
 TextIOWrapper. IMHO the best approach would to the implement a *real*
 codec named BOM (or sniff). This doesn't require *any* changes to
 the IO library. It could even be developed as a standalone project and
 published in the Cheeseshop.

Why not, this is another solution to the point (2) (Check for a BOM while 
reading or detect it before?). Which encoding would be used if there is not 
BOM? UTF-8 sounds like a good choice.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Antoine Pitrou
Walter Dörwald walter at livinglogic.de writes:
 
 On the surface this looks like there's an encoding named BOM, but 
 looking at your patch I found that the check is still done in 
 TextIOWrapper. IMHO the best approach would to the implement a *real* 
 codec named BOM (or sniff). This doesn't require *any* changes to 
 the IO library. It could even be developed as a standalone project and 
 published in the Cheeseshop.

Sorry but this is missing the point. The point here is to improve the open()
function. I'm sure people who know about encodings are able to install the
chardet library or even whip up their own BOM detection routine...


Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Martin v. Löwis
Antoine Pitrou wrote:
 Walter Dörwald walter at livinglogic.de writes:
 On the surface this looks like there's an encoding named BOM, but 
 looking at your patch I found that the check is still done in 
 TextIOWrapper. IMHO the best approach would to the implement a *real* 
 codec named BOM (or sniff). This doesn't require *any* changes to 
 the IO library. It could even be developed as a standalone project and 
 published in the Cheeseshop.
 
 Sorry but this is missing the point. The point here is to improve the open()
 function. I'm sure people who know about encodings are able to install the
 chardet library or even whip up their own BOM detection routine...

How does the requirement that it be implemented as a codec miss the
point?

FWIW, I agree with Walter that if it is provided through the encoding=
argument, it should be a codec. If it is built into the open function
(for whatever reason), it must be provided by some other parameter.

I do see the point that it becomes available to end users only when
released as part of Python. However, this *also* means that applications
won't be using it for another three years or so, since they'll have to
support older Python versions as well (unless it is integrated in the
case where no encoding is specified).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Antoine Pitrou
Martin v. Löwis martin at v.loewis.de writes:
 
  Sorry but this is missing the point. The point here is to improve the open()
  function. I'm sure people who know about encodings are able to install the
  chardet library or even whip up their own BOM detection routine...
 
 How does the requirement that it be implemented as a codec miss the
 point?

If we want it to be the default, it must be able to fallback on the current
locale-based algorithm if no BOM is found. I don't think it would be easy for a
codec to do that.

 FWIW, I agree with Walter that if it is provided through the encoding=
 argument, it should be a codec. If it is built into the open function
 (for whatever reason), it must be provided by some other parameter.

Why not simply encoding=None? The default value should provide the most useful
behaviour possible. Forcing users to choose between two different autodetection
strategies (encoding=None and another one) is a little insane IMO.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Lennart Regebro
On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou solip...@pitrou.net wrote:
 If we want it to be the default, it must be able to fallback on the current
 locale-based algorithm if no BOM is found. I don't think it would be easy for 
 a
 codec to do that.

Right. It seems like encoding=None is the right way to go there.
encoding='BOM' would probably only work if 'BOM' isn't an encoding but
a special tag, which is ugly.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Michael Foord

On 09/01/2010 22:14, Lennart Regebro wrote:

On Sat, Jan 9, 2010 at 21:28, Antoine Pitrousolip...@pitrou.net  wrote:
   

If we want it to be the default, it must be able to fallback on the current
locale-based algorithm if no BOM is found. I don't think it would be easy for a
codec to do that.
 

Right. It seems like encoding=None is the right way to go there.
encoding='BOM' would probably only work if 'BOM' isn't an encoding but
a special tag, which is ugly.

   
I would rather see it as the default behavior for open without an 
encoding specified.


I know Guido has expressed a preference against this so I won't continue 
to flog it.


The current behavior however is that we have a 'guessing' algorithm 
based on the platform default. Currently if you open a text file in read 
mode that has a UTF-8 signature, but the platform default is something 
other than UTF-8, then we open the file using what is likely to be the 
incorrect encoding. Looking for the signature seems to be better 
behaviour in that case.


All the best,

Michael

--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Martin v. Löwis
 How does the requirement that it be implemented as a codec miss the
 point?
 
 If we want it to be the default, it must be able to fallback on the current
 locale-based algorithm if no BOM is found. I don't think it would be easy for 
 a
 codec to do that.

Yes - however, Victor currently apparently *doesn't* want it to be the
default, but wants the user to specify encoding=BOM. If so, it isn't
the default, and it is easy to implement as a codec.

 FWIW, I agree with Walter that if it is provided through the encoding=
 argument, it should be a codec. If it is built into the open function
 (for whatever reason), it must be provided by some other parameter.
 
 Why not simply encoding=None?

I don't mind. Please re-read Walter's message - it only said that
*if* this is activated through encoding=BOM, *then* it must be
a codec, and could be on PyPI. I don't think Walter was talking about
the case it is not activated through encoding='BOM' *at all*.

 The default value should provide the most useful
 behaviour possible. Forcing users to choose between two different 
 autodetection
 strategies (encoding=None and another one) is a little insane IMO.

That wouldn't disturb me much. There are a lot of things in that area
that are a little insane, starting with Microsoft Windows :-)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.
 
 That doesn't make sense. If the file isn't UTF-8 you can't see the
 BOM, because the BOM itself is UTF-8-encoded.

I think what Glyph meant is this: if a file starts with the UTF-8
signature, assume it's UTF-8. Then validate the assumption against the
rest of the file also, and then process it as UTF-8. If the rest clearly
is not UTF-8, assume that the UTF-8 signature is bogus.

I understood this proposal as a general processing guideline, not
something the io library should do (but, say, a text editor).

FWIW, I'm personally in favor of using the UTF-8 signature. If people
consider them crazy talk, that may be because UTF-8 can't possibly have
a byte order - hence I call it a signature, not the BOM. As a signature,
I don't consider it crazy at all. There is a long tradition of having
magic bytes in files (executable files, Postscript, PDF, ... - see
/etc/magic). Having a magic byte sequence for plain text to denote the
encoding is useful and helps reducing moji-bake. This is the reason it's
used on Windows: notepad would normally assume that text is in the ANSI
code page, and for compatibility, it can't stop doing that. So the UTF-8
signature gives them an exit strategy.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
 But it should do something sane when reading such files.  I can't
 really see any harm in throwing it away, especially since use of
 ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated
 IIRC.

And indeed it does, when you open the file in the utf-8-sig encoding.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 03:23:08, MRAB a écrit :
 Guido van Rossum wrote:
  I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
  talk. And for the other two, perhaps it would make more sense to have
  a separate encoding-guessing function that takes a binary stream and
  returns a text stream wrapping it with the proper encoding?
 
 Alternatively, have a universal UTF-8/16/32 encoding, ie one that
 expects UTF-8,
 with or without BOM, or UTF-16/32 with BOM.

Do you mean open(filename, encoding=BOM)? I suppose that BOM would be a 
magical value specific to read a text file (open(filename, r)), not a real 
codec?

Otherwise which encoding should be used for open(filename, w, 
encoding=BOM)?

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
 Builtin open() function is unable to open an UTF-16/32 file starting with a 
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8 
 file starting with a BOM, read()/readline() returns also the BOM whereas the 
 BOM should be ignored.

It depends. If you use the utf-8-sig encoding, it *will* ignore the
UTF-8 signature.

 Since my proposition changes the result TextIOWrapper.read()/readline() for 
 files starting with a BOM, we might introduce an option to open() to enable 
 the new behaviour. But is it really needed to keep the backward compatibility?

Absolutely. And there is no need to produce a new option, but instead
use the existing options: define an encoding that auto-detects the
encoding from the family of BOMs. Maybe you call it encoding=sniff.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
(...)
 (And yes, I know this happens. Doesn't mean we need to auto-guess by
 default; there are lots of issues e.g. what should happen after
 seeking to offset 0?)

I wrote a new version of my patch (version 3):

 * don't change the default behaviour: use open(filename, encoding=BOM) to 
check the BOM is there is any
 * fix for seek(0): always ignore the BOM
 * add an unit test: check that the right encoding is detect, but also the the 
BOM is ignored (especially after a seek(0))

BOM encoding doesn't work for writing into a file, so open(filename, w, 
encoding=BOM) raises a ValueError.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 01:52:20, Guido van Rossum a écrit :
 And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

I choosed to modify open()+TextIOWrapper instead of writing a new function 
because I would like to avoid an extra read operation (syscall) on the file. 
With my implementation, no specific read operation is needed to detect the 
BOM. The BOM is simply checked in the first _read_chunk().

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :
  Builtin open() function is unable to open an UTF-16/32 file starting with
  a BOM if the encoding is not specified (raise an unicode error). For an
  UTF-8 file starting with a BOM, read()/readline() returns also the BOM
  whereas the BOM should be ignored.
 
 It depends. If you use the utf-8-sig encoding, it *will* ignore the
 UTF-8 signature.

Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and 
UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to 
remove the BOM after the first read (much harder if you use a module like 
ConfigParser or csv).

  Since my proposition changes the result TextIOWrapper.read()/readline()
  for files starting with a BOM, we might introduce an option to open() to
  enable the new behaviour. But is it really needed to keep the backward
  compatibility?
 
 Absolutely. And there is no need to produce a new option, but instead
 use the existing options: define an encoding that auto-detects the
 encoding from the family of BOMs. Maybe you call it encoding=sniff.

Good idea, I choosed open(filename, encoding=BOM).

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Victor Stinner victor.stinner at haypocalc.com writes:
 
 I wrote a new version of my patch (version 3):
 
  * don't change the default behaviour: use open(filename, encoding=BOM) to 
 check the BOM is there is any

Well, I think if we implement this the default behaviour *should* be changed.
It looks a bit senseless to have two different auto-choose options, one with
encoding=None and one with encoding=BOM.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 11:55 PM, Glyph Lefkowitz
gl...@twistedmatrix.com wrote:
 I'm saying that the BOM itself isn't enough to detect that the file is 
 actually UTF-8.

And I'm saying that it is, with as much certainty as we can ever guess
the encoding of a file.

 If (for whatever reason: explicitly specified, guessed in some other way) the 
 file's encoding is determined to be something else, the bytes comprising the 
 BOM should be decoded as normal.  It's just that the UTF-8 decoding of the 
 BOM at the start of a file should be .

Sure, a Latin-1-encoded file could start with the same pattern that is
a UTF-8-encoded BOM. But at that point, a UTF-16-encoded file is also
valid Latin-1.

The question was in the context of encoding-guessing; if we're
guessing, a UTF-8-encoded BOM cannot signify anything else but UTF-8.
(Ditto for UTF-16 and UTF-32 BOMs.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou solip...@pitrou.net wrote:
 Victor Stinner victor.stinner at haypocalc.com writes:

 I wrote a new version of my patch (version 3):

  * don't change the default behaviour: use open(filename, encoding=BOM) to
 check the BOM is there is any

 Well, I think if we implement this the default behaviour *should* be changed.
 It looks a bit senseless to have two different auto-choose options, one with
 encoding=None and one with encoding=BOM.

Well there *are* two different auto options: use the environment
variables (LANG etc.) or inspect the contents of the file. I think it
would be useful to have ways to specify both.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Fri, Jan 8, 2010 at 1:05 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.

 That doesn't make sense. If the file isn't UTF-8 you can't see the
 BOM, because the BOM itself is UTF-8-encoded.

 I think what Glyph meant is this: if a file starts with the UTF-8
 signature, assume it's UTF-8. Then validate the assumption against the
 rest of the file also, and then process it as UTF-8. If the rest clearly
 is not UTF-8, assume that the UTF-8 signature is bogus.

 I understood this proposal as a general processing guideline, not
 something the io library should do (but, say, a text editor).

 FWIW, I'm personally in favor of using the UTF-8 signature. If people
 consider them crazy talk, that may be because UTF-8 can't possibly have
 a byte order - hence I call it a signature, not the BOM. As a signature,
 I don't consider it crazy at all. There is a long tradition of having
 magic bytes in files (executable files, Postscript, PDF, ... - see
 /etc/magic). Having a magic byte sequence for plain text to denote the
 encoding is useful and helps reducing moji-bake. This is the reason it's
 used on Windows: notepad would normally assume that text is in the ANSI
 code page, and for compatibility, it can't stop doing that. So the UTF-8
 signature gives them an exit strategy.

Sure. I said crazy talk only to stir up discussion. Which worked. :-)

Also, I don't want Python's default behavior to change -- sniffing the
encoding should be a separate option.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver tsea...@palladion.com wrote:
 The BOM should not be seekeable if the file is opened with the proposed
 guess encoding from BOM mode:  it isn't properly part of the stream at
 all in that case.

This feels about right to me. There are still questions though:
immediately after opening a file with a BOM, what should .tell()
return? And regardless of that, .seek(0) should put the file in that
same initial state.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Guido van Rossum guido at python.org writes:
 
  Well, I think if we implement this the default behaviour *should* be 
  changed.
  It looks a bit senseless to have two different auto-choose options, one
with
  encoding=None and one with encoding=BOM.
 
 Well there *are* two different auto options: use the environment
 variables (LANG etc.) or inspect the contents of the file. I think it
 would be useful to have ways to specify both.

Yes, perhaps. In the context of open() however I think it would be helpful to
change the default.
Moreover, reading the BOM is certainly much more reliable than our current
guessing based on the locale or the device encoding.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread M.-A. Lemburg
Guido van Rossum wrote:
 On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou solip...@pitrou.net wrote:
 Victor Stinner victor.stinner at haypocalc.com writes:

 I wrote a new version of my patch (version 3):

  * don't change the default behaviour: use open(filename, encoding=BOM) to
 check the BOM is there is any

 Well, I think if we implement this the default behaviour *should* be changed.
 It looks a bit senseless to have two different auto-choose options, one 
 with
 encoding=None and one with encoding=BOM.
 
 Well there *are* two different auto options: use the environment
 variables (LANG etc.) or inspect the contents of the file. I think it
 would be useful to have ways to specify both.

Shouldn't this encoding guessing be a separate function that you call
on either a file or a seekable stream ?

After all, detecting encodings is just as useful to have for non-file
streams. You'd then avoid having to stuff everything into
a single function call and also open up the door for more complex
application specific guess work or defaults.

The whole process would then have two steps:

 1. guess encoding

  import codecs
  encoding = codecs.guess_file_encoding(filename)

 2. open the file with the found encoding

  f = open(filename, encoding=encoding)

For seekable streams f, you'd have:

 1. guess encoding

  import codecs
  encoding = codecs.guess_stream_encoding(f)

 2. wrap the stream with a reader for the found encoding

  reader_class = codecs.getreader(encoding)
  g = reader_class(f)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 08 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Guido van Rossum guido at python.org writes:
 
 On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver tseaver at palladion.com
wrote:
  The BOM should not be seekeable if the file is opened with the proposed
  guess encoding from BOM mode:  it isn't properly part of the stream at
  all in that case.
 
 This feels about right to me. There are still questions though:
 immediately after opening a file with a BOM, what should .tell()
 return?

tell() in the context of text I/O is specified to return an opaque cookie. So
whatever value it returns would probably be fine, as long as seeking to that
value leaves the file in an acceptable state.

Rewinding (seeking to 0) in the presence of a BOM is already reasonably
supported by the TextIOWrapper object:

 dec = codecs.getincrementaldecoder('utf-16')()
 dec.decode(b'\xff\xfea\x00b\x00')
'ab'
 dec.decode(b'\xff\xfea\x00b\x00')
'\ufeffab'
 
 bio = io.BytesIO(b'\xff\xfea\x00b\x00')
 f = io.TextIOWrapper(bio, encoding='utf-16')
 f.read()
'ab'
 f.seek(0)
0
 f.read()
'ab'

There are tests for this in test_io.py (test_encoded_writes, line 1929, and
test_append_bom and test_seek_bom, line 2045).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread MRAB

Victor Stinner wrote:

Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
(...)

(And yes, I know this happens. Doesn't mean we need to auto-guess by
default; there are lots of issues e.g. what should happen after
seeking to offset 0?)


I wrote a new version of my patch (version 3):

 * don't change the default behaviour: use open(filename, encoding=BOM) to 
check the BOM is there is any

 * fix for seek(0): always ignore the BOM
 * add an unit test: check that the right encoding is detect, but also the the 
BOM is ignored (especially after a seek(0))


BOM encoding doesn't work for writing into a file, so open(filename, w, 
encoding=BOM) raises a ValueError.



I think it's similar to universal newline mode. You can tell it that
you're reading UTF-something-encoded text (common forms only).

The preference is UTF-8, but it could be UTF-8-sig (from Windows), or
possibly UTF-16/32, which really need a BOM because there are multiple
bytes per codepoint, so the bytes could be big-endian or little-endian.

The BOM (or signature) tells you what the encoding is, defaulting to
UTF-8 if there's none. If it subsequently raises a DecodeError, then
so be it!

Maybe there should also be a way of determining what encoding it decided
it was, so that you can then write a new file in that same encoding.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Guido van Rossum wrote:
 On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver tsea...@palladion.com wrote:
 The BOM should not be seekeable if the file is opened with the proposed
 guess encoding from BOM mode:  it isn't properly part of the stream at
 all in that case.
 
 This feels about right to me. There are still questions though:
 immediately after opening a file with a BOM, what should .tell()
 return? And regardless of that, .seek(0) should put the file in that
 same initial state.

I think the behavior should be something like:

  f = open('/path/to/maybe-BOM-encoded-file', 'r', encoding='BOM')
  f.tell()
 0L
  f.seek(-1)
  f.tell() # count of unicode chars in decoded stream
 45L
  f.seek(0)
  f.read(1) # read first unicode char decoded from stream.
 'A'

In other words, the BOM is not readable / seekable at all:  it is
invisible to the consumer of the decoded stream.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHnyIACgkQ+gerLs4ltQ6s3QCgznD+7FbUzfCbe5TS6OcoXjMg
rdgAoJAMEXe2xwLCIwJaZ6XA6rVyTIAi
=oXb3
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

M.-A. Lemburg wrote:

 Shouldn't this encoding guessing be a separate function that you call
 on either a file or a seekable stream ?
 
 After all, detecting encodings is just as useful to have for non-file
 streams.

Other stream sources typically have out-of-band ways to signal the
encoding:  only when reading from the filesystem do we pretty much
*have* to guess, and in that case the BOM / signature is the best
heuristic we have.  Also, some non-file streams are not seekable, and so
can't be guessed via a pre-pass.

 You'd then avoid having to stuff everything into
 a single function call and also open up the door for more complex
 application specific guess work or defaults.
 
 The whole process would then have two steps:
 
  1. guess encoding
 
   import codecs
   encoding = codecs.guess_file_encoding(filename)

Filename is not enough information:  or do you mean that API to actually
open the stream?

  2. open the file with the found encoding
 
   f = open(filename, encoding=encoding)
 
 For seekable streams f, you'd have:
 
  1. guess encoding
 
   import codecs
   encoding = codecs.guess_stream_encoding(f)
 
  2. wrap the stream with a reader for the found encoding
 
   reader_class = codecs.getreader(encoding)
   g = reader_class(f)
 


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHoU4ACgkQ+gerLs4ltQ5o3QCeLOJ7J91E+5f66vhgu1BUhYh4
9UgAnR2IeCd0BCsPez8ZilGNHJfhRn3Y
=SoPb
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martin v. Löwis wrote:

 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.
 That doesn't make sense. If the file isn't UTF-8 you can't see the
 BOM, because the BOM itself is UTF-8-encoded.
 
 I think what Glyph meant is this: if a file starts with the UTF-8
 signature, assume it's UTF-8. Then validate the assumption against the
 rest of the file also, and then process it as UTF-8. If the rest clearly
 is not UTF-8, assume that the UTF-8 signature is bogus.

If the programmer opens the file using a guess using the BOM encoding,
 Python should *not* attempt to verify that the file is properly
encoded:  it should check for (and consume) any BOM, and then return a
stream which uses the encoding inferred from the BOM.  Any errors should
be handled later, when characters are read, exactly as if the file had
been opened with the same encoding guessed from the BOM.

 I understood this proposal as a general processing guideline, not
 something the io library should do (but, say, a text editor).
 
 FWIW, I'm personally in favor of using the UTF-8 signature. If people
 consider them crazy talk, that may be because UTF-8 can't possibly have
 a byte order - hence I call it a signature, not the BOM. As a signature,
 I don't consider it crazy at all. There is a long tradition of having
 magic bytes in files (executable files, Postscript, PDF, ... - see
 /etc/magic). Having a magic byte sequence for plain text to denote the
 encoding is useful and helps reducing moji-bake. This is the reason it's
 used on Windows: notepad would normally assume that text is in the ANSI
 code page, and for compatibility, it can't stop doing that. So the UTF-8
 signature gives them an exit strategy.

Agreed.  Having that marker at the start of the file makes interop with
other tools *much* easier.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHoFMACgkQ+gerLs4ltQ73dACffwUfyh6Q9vUnKYf367QFjNcU
RRMAoNuKCWEx7j+MSdTv+UjhAPynBc14
=uAX6
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Eric Smith
 Shouldn't this encoding guessing be a separate function that you call
 on either a file or a seekable stream ?

 After all, detecting encodings is just as useful to have for non-file
 streams.

 Other stream sources typically have out-of-band ways to signal the
 encoding:  only when reading from the filesystem do we pretty much
 *have* to guess, and in that case the BOM / signature is the best
 heuristic we have.  Also, some non-file streams are not seekable, and so
 can't be guessed via a pre-pass.

But what if the file were in (for example) a zip file? I think you
definitely want to have access to this functionality outside of open().

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread James Y Knight

On Jan 8, 2010, at 4:14 PM, Tres Seaver wrote:

I understood this proposal as a general processing guideline, not
something the io library should do (but, say, a text editor).

FWIW, I'm personally in favor of using the UTF-8 signature. If people
consider them crazy talk, that may be because UTF-8 can't possibly  
have
a byte order - hence I call it a signature, not the BOM. As a  
signature,

I don't consider it crazy at all. There is a long tradition of having
magic bytes in files (executable files, Postscript, PDF, ... - see
/etc/magic). Having a magic byte sequence for plain text to denote  
the
encoding is useful and helps reducing moji-bake. This is the reason  
it's
used on Windows: notepad would normally assume that text is in the  
ANSI
code page, and for compatibility, it can't stop doing that. So the  
UTF-8

signature gives them an exit strategy.


Agreed.  Having that marker at the start of the file makes interop  
with

other tools *much* easier.


Putting the BOM at the beginning of UTF-8 text files is not a good  
idea, it makes interop much *worse* on a unix system, not better.  
Without the BOM, most commands do the right thing with UTF-8 text.  
E.g. to concatenate two files:


$ cat file-1 file-2  file-3

With a BOM at the beginning of the file, it won't work right. Of  
course, you could modify cat (and every other stream processing  
command) to know how to consume and emit BOMs, and omit the extra one  
that would show up in the middle of the stream...but even that can't  
work; what about:

$ (cat file-1; cat file-2)  file-3.

Should the shell now know that when you run multiple commands, it  
should eat the BOM emitted from the second command?


Basically, using a BOM in a utf-8 file is just not a good idea: it  
completely ruins interop with every standard unix tool.


This is not to say that Python shouldn't have a way to read a file  
with a UTF-8 BOM: it just shouldn't encourage you to *write* such files.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread M.-A. Lemburg
Tres Seaver wrote:
 M.-A. Lemburg wrote:
 
 Shouldn't this encoding guessing be a separate function that you call
 on either a file or a seekable stream ?
 
 After all, detecting encodings is just as useful to have for non-file
 streams.
 
 Other stream sources typically have out-of-band ways to signal the
 encoding:  only when reading from the filesystem do we pretty much
 *have* to guess, and in that case the BOM / signature is the best
 heuristic we have.  Also, some non-file streams are not seekable, and so
 can't be guessed via a pre-pass.

Sure there are non-seekable file streams, but at least when
using StringIO-type streams you don't have that restriction.

An encoding detection function would provide more use in other
cases as well, so instead of hiding away the functionality in
the open() constructor, I'm suggesting to make expose it via
the codecs module.

Applications can then use it where necessary and also provide their
own defaults, using other heuristics.

 You'd then avoid having to stuff everything into
 a single function call and also open up the door for more complex
 application specific guess work or defaults.
 
 The whole process would then have two steps:
 
  1. guess encoding
 
   import codecs
   encoding = codecs.guess_file_encoding(filename)
 
 Filename is not enough information:  or do you mean that API to actually
 open the stream?

Yes. The API would open the file, guess the encoding and then
close it again. If you don't want that to happen, you could use
the second API I mentioned below on the already open file.

Note that this function could detect a lot more encodings with
reasonably high probability than just BOM-prefixed ones,
e.g. we could also add support to detect encoding declarations
such as the ones we use in Python source files.

  2. open the file with the found encoding
 
   f = open(filename, encoding=encoding)
 
 For seekable streams f, you'd have:
 
  1. guess encoding
 
   import codecs
   encoding = codecs.guess_stream_encoding(f)

I forgot to mention: This API needs to position the file pointer
to the start of the first data byte.

  2. wrap the stream with a reader for the found encoding
 
   reader_class = codecs.getreader(encoding)
   g = reader_class(f)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 08 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Eric Smith wrote:
 Shouldn't this encoding guessing be a separate function that you call
 on either a file or a seekable stream ?

 After all, detecting encodings is just as useful to have for non-file
 streams.
 Other stream sources typically have out-of-band ways to signal the
 encoding:  only when reading from the filesystem do we pretty much
 *have* to guess, and in that case the BOM / signature is the best
 heuristic we have.  Also, some non-file streams are not seekable, and so
 can't be guessed via a pre-pass.
 
 But what if the file were in (for example) a zip file? I think you
 definitely want to have access to this functionality outside of open().

If the application expects a possibly-BOM-signature-marked file, but you
pass it mismatched garbage:

   f = open('some.zip', encoding='BOM)

the error handling should be the same as if you passed any other
mismatched encoding:

   f = open('some.zip', encoding='UTF8')

i.e., you discover the error when you try to read from the (non)encoded
stream, not when you open it.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHqpwACgkQ+gerLs4ltQ7uAACeKEc+WT4TASGcVl1Hfqe6L9La
I6EAn1pJtngtLWPdothGbYB+zUabEvTW
=TjBK
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 22:40:47, Eric Smith a écrit :
  Shouldn't this encoding guessing be a separate function that you call
  on either a file or a seekable stream ?
 
  After all, detecting encodings is just as useful to have for non-file
  streams.
 
  Other stream sources typically have out-of-band ways to signal the
  encoding:  only when reading from the filesystem do we pretty much
  *have* to guess, and in that case the BOM / signature is the best
  heuristic we have.  Also, some non-file streams are not seekable, and so
  can't be guessed via a pre-pass.
 
 But what if the file were in (for example) a zip file? I think you
 definitely want to have access to this functionality outside of open().

FYI my patch (encoding=BOM) is implemented in TextIOWrapper, and 
TextIOWrapper takes a binary stream as input, not a filename.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Georg Brandl
Am 08.01.2010 22:14, schrieb Tres Seaver:

 FWIW, I'm personally in favor of using the UTF-8 signature. If people
 consider them crazy talk, that may be because UTF-8 can't possibly have
 a byte order - hence I call it a signature, not the BOM. As a signature,
 I don't consider it crazy at all. There is a long tradition of having
 magic bytes in files (executable files, Postscript, PDF, ... - see
 /etc/magic). Having a magic byte sequence for plain text to denote the
 encoding is useful and helps reducing moji-bake. This is the reason it's
 used on Windows: notepad would normally assume that text is in the ANSI
 code page, and for compatibility, it can't stop doing that. So the UTF-8
 signature gives them an exit strategy.
 
 Agreed.  Having that marker at the start of the file makes interop with
 other tools *much* easier.

Except if only 50% of the other tools support the signature.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Nick Coghlan
MRAB wrote:
 Maybe there should also be a way of determining what encoding it decided
 it was, so that you can then write a new file in that same encoding.

I thought of that question as well - the f.encoding attribute on the
opened file should be sufficient.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Victor Stinner
Hi,

Builtin open() function is unable to open an UTF-16/32 file starting with a 
BOM if the encoding is not specified (raise an unicode error). For an UTF-8 
file starting with a BOM, read()/readline() returns also the BOM whereas the 
BOM should be ignored.

See recent issues related to reading an UTF-8 text file including a BOM: #7185 
(csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with 
the UTF-8-SIG encoding, but it's possible to do better.

I propose to improve open() (TextIOWrapper) by using the BOM to choose the 
right encoding. I think that only files opened in read only mode should 
support this new feature. *Read* the BOM in a *write* only file would cause 
unexpected behaviours.

Since my proposition changes the result TextIOWrapper.read()/readline() for 
files starting with a BOM, we might introduce an option to open() to enable 
the new behaviour. But is it really needed to keep the backward compatibility?

I wrote a proof of concept attached to the issue #7651. My patch only changes 
the behaviour of TextIOWrapper for reading files starting with a BOM. It 
doesn't work yet if a seek() is used before the first read.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Guido van Rossum
I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
talk. And for the other two, perhaps it would make more sense to have
a separate encoding-guessing function that takes a binary stream and
returns a text stream wrapping it with the proper encoding?

--Guido

On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
victor.stin...@haypocalc.com wrote:
 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8
 file starting with a BOM, read()/readline() returns also the BOM whereas the
 BOM should be ignored.

 See recent issues related to reading an UTF-8 text file including a BOM: #7185
 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with
 the UTF-8-SIG encoding, but it's possible to do better.

 I propose to improve open() (TextIOWrapper) by using the BOM to choose the
 right encoding. I think that only files opened in read only mode should
 support this new feature. *Read* the BOM in a *write* only file would cause
 unexpected behaviours.

 Since my proposition changes the result TextIOWrapper.read()/readline() for
 files starting with a BOM, we might introduce an option to open() to enable
 the new behaviour. But is it really needed to keep the backward compatibility?

 I wrote a proof of concept attached to the issue #7651. My patch only changes
 the behaviour of TextIOWrapper for reading files starting with a BOM. It
 doesn't work yet if a seek() is used before the first read.

 --
 Victor Stinner
 http://www.haypocalc.com/
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread MRAB

Guido van Rossum wrote:

I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
talk. And for the other two, perhaps it would make more sense to have
a separate encoding-guessing function that takes a binary stream and
returns a text stream wrapping it with the proper encoding?

Alternatively, have a universal UTF-8/16/32 encoding, ie one that 
expects UTF-8,

with or without BOM, or UTF-16/32 with BOM.


On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
victor.stin...@haypocalc.com wrote:

Hi,

Builtin open() function is unable to open an UTF-16/32 file starting with a
BOM if the encoding is not specified (raise an unicode error). For an UTF-8
file starting with a BOM, read()/readline() returns also the BOM whereas the
BOM should be ignored.

See recent issues related to reading an UTF-8 text file including a BOM: #7185
(csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with
the UTF-8-SIG encoding, but it's possible to do better.

I propose to improve open() (TextIOWrapper) by using the BOM to choose the
right encoding. I think that only files opened in read only mode should
support this new feature. *Read* the BOM in a *write* only file would cause
unexpected behaviours.

Since my proposition changes the result TextIOWrapper.read()/readline() for
files starting with a BOM, we might introduce an option to open() to enable
the new behaviour. But is it really needed to keep the backward compatibility?

I wrote a proof of concept attached to the issue #7651. My patch only changes
the behaviour of TextIOWrapper for reading files starting with a BOM. It
doesn't work yet if a seek() is used before the first read.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Glyph Lefkowitz


On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:
 Hi,
 
 Builtin open() function is unable to open an UTF-16/32 file starting with a
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8
 file starting with a BOM, read()/readline() returns also the BOM whereas the
 BOM should be ignored.

 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

It *is* crazy, but unfortunately rather common.  Wikipedia has a good 
description of the issues: 
http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some Windows 
text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, 
so it's become a convention to do that.  That's not good enough, so you need to 
guess the encoding as well to make sure, but if there is a BOM and you can 
otherwise verify that the file is probably UTF-8 encoded, you should discard it.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote:


 On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:

 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a

 BOM if the encoding is not specified (raise an unicode error). For an UTF-8

 file starting with a BOM, read()/readline() returns also the BOM whereas the

 BOM should be ignored.

 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.

That doesn't make sense. If the file isn't UTF-8 you can't see the
BOM, because the BOM itself is UTF-8-encoded.

(And yes, I know this happens. Doesn't mean we need to auto-guess by
default; there are lots of issues e.g. what should happen after
seeking to offset 0?)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Stephen J. Turnbull
Guido van Rossum writes:

  I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
  talk.

That doesn't stop many applications from doing it.  Python should
perhapswink,nudge not produce UTF-8 + BOM without a disclaimer of
indemnification against all resulting damage, signed in blood, from
the user for each instance.

But it should do something sane when reading such files.  I can't
really see any harm in throwing it away, especially since use of
ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated
IIRC.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Guido van Rossum wrote:
 On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com 
 wrote:

 On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:

 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a

 BOM if the encoding is not specified (raise an unicode error). For an UTF-8

 file starting with a BOM, read()/readline() returns also the BOM whereas the

 BOM should be ignored.

 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.
 
 That doesn't make sense. If the file isn't UTF-8 you can't see the
 BOM, because the BOM itself is UTF-8-encoded.
 
 (And yes, I know this happens. Doesn't mean we need to auto-guess by
 default; there are lots of issues e.g. what should happen after
 seeking to offset 0?)

The BOM should not be seekeable if the file is opened with the proposed
guess encoding from BOM mode:  it isn't properly part of the stream at
all in that case.

A UTF-8 BOM is an absurditiy, but it exists *everywhere* in the wild:
Python would do wll to make it as easy as possible to consume such
files, as well as the non-insane versions (UTF-16 / UTF-32 BOMs).  In
the best of all possible worlds, I would just try opening the file so:

  f = open('/path/to/file', 'r', encoding=DWIFM)

and any BOM present would set the encoding for the remainder of the stream..



Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktGzLsACgkQ+gerLs4ltQ5+cwCdGfycPdj6+cPfD23vH644SpHL
sI0AoLGD7nfgMEJdJhBr90yjQQHfDgcJ
=js+2
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Glyph Lefkowitz

On Jan 7, 2010, at 11:21 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com 
 wrote:
 
 On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:
 
 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?
 
 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.
 
 That doesn't make sense. If the file isn't UTF-8 you can't see the
 BOM, because the BOM itself is UTF-8-encoded.

I'm saying that the BOM itself isn't enough to detect that the file is actually 
UTF-8.  If (for whatever reason: explicitly specified, guessed in some other 
way) the file's encoding is determined to be something else, the bytes 
comprising the BOM should be decoded as normal.  It's just that the UTF-8 
decoding of the BOM at the start of a file should be .

 (And yes, I know this happens. Doesn't mean we need to auto-guess by
 default; there are lots of issues e.g. what should happen after
 seeking to offset 0?)

I think it's pretty clear that the BOM should still be skipped in that case ...

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com