Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On 09.01.10 14:38, Victor Stinner wrote: Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit : Good idea, I choosed open(filename, encoding=BOM). On the surface this looks like there's an encoding named BOM, but looking at your patch I found that the check is still done in TextIOWrapper. IMHO the best approach would to the implement a *real* codec named BOM (or sniff). This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. Why not, this is another solution to the point (2) (Check for a BOM while reading or detect it before?). Which encoding would be used if there is not BOM? UTF-8 sounds like a good choice. UTF-8 might be a good choice, are the failback could be specified in the encoding name, i.e. open(file.txt, encoding=BOM-UTF-8) falls back to UTF-8, if there's no BOM at the start. This could be implemented via a custom codec search function (see http://docs.python.org/library/codecs.html#codecs.register for more info). Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Mon, Jan 11, 2010 at 11:37, Walter Dörwald wal...@livinglogic.de wrote: UTF-8 might be a good choice No, fallback if there is no BOM should be the local settings, just as fallback is today if you don't specify a codec. I mean, what if you want to look for a BOM but fall back to something else? How far will we go with encoding special information in the codecs names? codec='BOM else UTF-16 else locale'? :-) BOM is not a locale, and should not be a locale. Having a locale called BOM is wrong per se. It should either be default to look for a BOM when codec=None, or a special parameter. If none of these are desired, then we need a special function that takes a filename or file handle, and looks for a BOM and returns the codec found or None. But I find that much less natural and obvious than checking the BOM when codec=None. -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python-incompatibility.googlecode.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Mon, Jan 11, 2010 at 12:12, Lennart Regebro rege...@gmail.com wrote: BOM is not a locale, and should not be a locale. Having a locale called BOM is wrong per se. D'oh! I mean codec here obviously. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On 10.01.10 00:40, Martin v. Löwis wrote: How does the requirement that it be implemented as a codec miss the point? If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. Yes - however, Victor currently apparently *doesn't* want it to be the default, but wants the user to specify encoding=BOM. If so, it isn't the default, and it is easy to implement as a codec. FWIW, I agree with Walter that if it is provided through the encoding= argument, it should be a codec. If it is built into the open function (for whatever reason), it must be provided by some other parameter. Why not simply encoding=None? I don't mind. Please re-read Walter's message - it only said that *if* this is activated through encoding=BOM, *then* it must be a codec, and could be on PyPI. I don't think Walter was talking about the case it is not activated through encoding='BOM' *at all*. However if this autodetection feature is useful in other cases (no matter how it's activated), it should be a codec, because as part of the open() function it isn't reusable. The default value should provide the most useful behaviour possible. Forcing users to choose between two different autodetection strategies (encoding=None and another one) is a little insane IMO. And encoding=mbcs is a third option on Windows. That wouldn't disturb me much. There are a lot of things in that area that are a little insane, starting with Microsoft Windows :-) ;) Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
However if this autodetection feature is useful in other cases (no matter how it's activated), it should be a codec, because as part of the open() function it isn't reusable. It is reusable as part of io.TextIOWrapper, though. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Mon, Jan 11, 2010 at 13:29, Walter Dörwald wal...@livinglogic.de wrote: However if this autodetection feature is useful in other cases (no matter how it's activated), it should be a codec, because as part of the open() function it isn't reusable. But an autodetect feature is not a codec. Sure it should be reusable, but making it a codec seems to be a weird hack to me. And how would you reuse it if it was a codec? A reusable autodetect feature would be useable to detect what codec it is. A autodetect codec would not be useful for that, as it would simply just decode. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On 11.01.10 13:45, Lennart Regebro wrote: On Mon, Jan 11, 2010 at 13:29, Walter Dörwald wal...@livinglogic.de wrote: However if this autodetection feature is useful in other cases (no matter how it's activated), it should be a codec, because as part of the open() function it isn't reusable. But an autodetect feature is not a codec. Sure it should be reusable, but making it a codec seems to be a weird hack to me. I think we already had this discussion two years ago in the context of XML decoding ;): http://mail.python.org/pipermail/python-dev/2007-November/075138.html And how would you reuse it if it was a codec? A reusable autodetect feature would be useable to detect what codec it is. A autodetect codec would not be useful for that, as it would simply just decode. I have implemented an XML codec (as part of XIST: http://pypi.python.org/pypi/ll-xist), that can do that: from ll import xml_codec import codecs c = codecs.getincrementaldecoder(xml)() c.encoding c.decode(?xml) u'' c.encoding c.decode( version='1.0') u'' c.encoding c.decode( encoding='iso-8859-1'?) u?xml version='1.0' encoding='iso-8859-1'? c.encoding 'iso-8859-1' Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Mon, Jan 11, 2010 at 14:21, Walter Dörwald wal...@livinglogic.de wrote: I think we already had this discussion two years ago in the context of XML decoding ;): Yup. Ans Martins answer then is my answer now: So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec? Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding. The conclusion was that a method do autodetect encodings would be good. I think the same conclusion applies here. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
But an autodetect feature is not a codec. Sure it should be reusable, but making it a codec seems to be a weird hack to me. Well, the existing UTF-16 codec also is an autodetect feature (to detect the endianness), and I don't consider it a weird hack. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Mon, Jan 11, 2010 at 18:16, Martin v. Löwis mar...@v.loewis.de wrote: But an autodetect feature is not a codec. Sure it should be reusable, but making it a codec seems to be a weird hack to me. Well, the existing UTF-16 codec also is an autodetect feature (to detect the endianness), and I don't consider it a weird hack. So the BOM codec should raise a UnicodeDecodeError if there is no BOM? Because that's what it would have to do, in that case, because it can't fall back on anything, it has to handle and implement all encodings that have a BOM. And is it then actually very useful? You would have to do a try/except first with encoding='BOM' and then encoding=None to get the fallback to the standard. I must say that I find this whole thing pretty obvious. 'BOM' is not an encoding. Either there should be a method to get the encoding from the BOM, returning None of there isn't one, or open() should look at the BOM when you pass in encoding=None. Or both. That covers all usecases, is easy and obvious. Either open(file=foo, encoding=None) or open(file, encoding=encoding_from_bom(file)) I can't see that open(file, encoding='BOM') has any benefit over this, covers any extra usecase and is clearer in any way. Instead it adds something confusing: An encoding that isn't an encoding. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Lennart Regebro wrote: On Mon, Jan 11, 2010 at 11:37, Walter Dörwald wal...@livinglogic.de wrote: UTF-8 might be a good choice No, fallback if there is no BOM should be the local settings, just as fallback is today if you don't specify a codec. I mean, what if you want to look for a BOM but fall back to something else? How far will we go with encoding special information in the codecs names? codec='BOM else UTF-16 else locale'? :-) BOM is not a locale, and should not be a locale. Having a locale called BOM is wrong per se. It should either be default to look for a BOM when codec=None, or a special parameter. If none of these are desired, then we need a special function that takes a filename or file handle, and looks for a BOM and returns the codec found or None. But I find that much less natural and obvious than checking the BOM when codec=None. Or pass a function that accepts a byte stream or the first few bytes and returns the encoding and any unused bytes (because the byte stream might not be seekable)? def guess_encoding(byte_stream): data = byte_stream.read(2) if data == b\xFE\xFF: return UTF-16BE, b return UTF-8, data text_file = open(filename, encoding=guess_encoding) ... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
I must say that I find this whole thing pretty obvious. 'BOM' is not an encoding. That I certainly agree with. That covers all usecases, is easy and obvious. Either open(file=foo, encoding=None) or open(file, encoding=encoding_from_bom(file)) I can't see that open(file, encoding='BOM') has any benefit over this, Well, it would have the advantage that Walter pointed out: you can implement it independent of the open() implementation, and even provide it in older versions of Python. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. [...] I had similar issues too (please read below ;o) ... On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum gu...@python.org wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? About guessing the encoding, I experienced this issue while I was developing a Trac plugin. What I was doing is as follows : - I guessed the MIME type + charset encoding using Trac MIME API (it was a CSV file encoded using UTF-16) - I read the file using `open` - Then wrapped the file using `codecs.EncodedFile` - Then used `csv.reader` ... and still get the BOM in the first value of the first row in the CSV file. {{{ #!python mimetype 'utf-16-le' ef = EncodedFile(f, 'utf-8', mimetype) }}} IMO I think I am +1 for leaving `open` just like it is, and use module `codecs` to deal with encodings, but I am strongly -1 for returning the BOM while using `EncodedFile` (mainly because encoding is explicitly supplied in ;o) --Guido CMIIW anyway ... -- Regards, Olemis. Blog ES: http://simelo-es.blogspot.com/ Blog EN: http://simelo-en.blogspot.com/ Featured article: ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Olemis Lang wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. [...] I had similar issues too (please read below ;o) ... On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum gu...@python.org wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? About guessing the encoding, I experienced this issue while I was developing a Trac plugin. What I was doing is as follows : - I guessed the MIME type + charset encoding using Trac MIME API (it was a CSV file encoded using UTF-16) - I read the file using `open` - Then wrapped the file using `codecs.EncodedFile` - Then used `csv.reader` ... and still get the BOM in the first value of the first row in the CSV file. You didn't say, but I presume that the charset guessing logic returned either 'utf-16-le' or 'utf-16-be' - those encodings don't remove the leading BOM. The 'utf-16' codec will remove the BOM. {{{ #!python mimetype 'utf-16-le' ef = EncodedFile(f, 'utf-8', mimetype) }}} Same here: the UTF-8 codec will not remove the BOM, you have to use the 'utf-8-sig' codec for that. IMO I think I am +1 for leaving `open` just like it is, and use module `codecs` to deal with encodings, but I am strongly -1 for returning the BOM while using `EncodedFile` (mainly because encoding is explicitly supplied in ;o) Note that EncodedFile() doesn't do any fancy BOM detection or filtering. This is the job of the codecs. Also note that BOM removal is only valid at the beginning of a file. All subsequent BOM-bytes have to be read as-is (they map to a zero-width non-breaking space) - without removing them. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 11 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Probably one part of this is OT , but I think it could complement the discussion ;o) On Mon, Jan 11, 2010 at 3:44 PM, M.-A. Lemburg m...@egenix.com wrote: Olemis Lang wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. [...] I had similar issues too (please read below ;o) ... On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum gu...@python.org wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? About guessing the encoding, I experienced this issue while I was developing a Trac plugin. What I was doing is as follows : - I guessed the MIME type + charset encoding using Trac MIME API (it was a CSV file encoded using UTF-16) - I read the file using `open` - Then wrapped the file using `codecs.EncodedFile` - Then used `csv.reader` ... and still get the BOM in the first value of the first row in the CSV file. You didn't say, but I presume that the charset guessing logic returned either 'utf-16-le' or 'utf-16-be' Yes. In fact they return the full mimetype 'text/csv; charset=utf-16-le' ... ;o) - those encodings don't remove the leading BOM. ... and they should ? The 'utf-16' codec will remove the BOM. In this particular case there's nothing I can do, I have to process whatever charset is detected in the input ;o) {{{ #!python mimetype 'utf-16-le' ef = EncodedFile(f, 'utf-8', mimetype) }}} Same here: the UTF-8 codec will not remove the BOM, you have to use the 'utf-8-sig' codec for that. IMO I think I am +1 for leaving `open` just like it is, and use module `codecs` to deal with encodings, but I am strongly -1 for returning the BOM while using `EncodedFile` (mainly because encoding is explicitly supplied in ;o) Note that EncodedFile() doesn't do any fancy BOM detection or filtering. ... directly. This is the job of the codecs. OK ... to come back to the scope of the subject, in the general case, I think that BOM (if any) should be handled by codecs, and therefore indirectly by EncodedFile . If that's a explicit way of working with encodings I'd prefer to use that wrapper explicitly in order to (encode | decode) the file and let the codec detect whether there's a BOM or not and «adjust» `tell`, `read` and everything else in that wrapper (instead of `open`). Also note that BOM removal is only valid at the beginning of a file. All subsequent BOM-bytes have to be read as-is (they map to a zero-width non-breaking space) - without removing them. ;o) -- Regards, Olemis. Blog ES: http://simelo-es.blogspot.com/ Blog EN: http://simelo-en.blogspot.com/ Featured article: Test cases for custom query (i.e report 9) ... PASS (1.0.0) - http://simelo.hg.sourceforge.net/hgweb/simelo/trac-gviz/rev/d276011e7297 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
If Python should support BOM when reading text files, it should also be able to *write* such files. An encoding=BOM argument wouldn't help here, because it does not specify which encoding to use actually: UFT-8, UTF-16-LE or what? That would be a point against encoding=BOM and pro an additional keyword argument use_bom or whatever with the following values: None: default (old) behaviour: don't handle BOM at all True: reading: expect BOM (raising an exception if it's missing). The encoding argument must be None or it must match the encoding implied by the BOM writing: write a BOM. The encoding argument must be one of the UTF encodings. False: reading: If a BOM is present, use it to determine the file encoding. The encoding argument must be None or it must match the encoding implied by the BOM. (*) Otherwise, use the encoding argument to determine the encoding. writing: do not write a BOM. Use the encoding argument. (*) This is a question of taste. I think some people would prefer a fourth value AUTO instead, or to swap the behaviour of None and False. Henning P.S. To make things worse, I have sometimes seen XML files with a UTF-8 BOM, but an XML encoding declaration of iso-8859-1. For such files, whatever you guess will be wrong anyway... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Sun, Jan 10, 2010 at 12:10, Henning von Bargen henning.vonbar...@arcor.de wrote: If Python should support BOM when reading text files, it should also be able to *write* such files. That's what I thought too. Turns out the UTF-16 does write such a mark. You also have the constants in the codecs module, so you can write the utf-16-le BOM and then use the utf-16-le encoding if you want to be sure you write utf-16-le, and the same with BE, of course. I still think now using BOM's when determining the file format can be seen as a bug, though, so I don't think the API needs to change at all. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Victor Stinner wrote: Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit : Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. It depends. If you use the utf-8-sig encoding, it *will* ignore the UTF-8 signature. Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to remove the BOM after the first read (much harder if you use a module like ConfigParser or csv). Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? Absolutely. And there is no need to produce a new option, but instead use the existing options: define an encoding that auto-detects the encoding from the family of BOMs. Maybe you call it encoding=sniff. Good idea, I choosed open(filename, encoding=BOM). On the surface this looks like there's an encoding named BOM, but looking at your patch I found that the check is still done in TextIOWrapper. IMHO the best approach would to the implement a *real* codec named BOM (or sniff). This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. To see how something like this can be done, take a look at the UTF-16 codec, that switches to bigendian or littleendian mode depending on the first read/decode call. Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit : Good idea, I choosed open(filename, encoding=BOM). On the surface this looks like there's an encoding named BOM, but looking at your patch I found that the check is still done in TextIOWrapper. IMHO the best approach would to the implement a *real* codec named BOM (or sniff). This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. Why not, this is another solution to the point (2) (Check for a BOM while reading or detect it before?). Which encoding would be used if there is not BOM? UTF-8 sounds like a good choice. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Walter Dörwald walter at livinglogic.de writes: On the surface this looks like there's an encoding named BOM, but looking at your patch I found that the check is still done in TextIOWrapper. IMHO the best approach would to the implement a *real* codec named BOM (or sniff). This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. Sorry but this is missing the point. The point here is to improve the open() function. I'm sure people who know about encodings are able to install the chardet library or even whip up their own BOM detection routine... Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Antoine Pitrou wrote: Walter Dörwald walter at livinglogic.de writes: On the surface this looks like there's an encoding named BOM, but looking at your patch I found that the check is still done in TextIOWrapper. IMHO the best approach would to the implement a *real* codec named BOM (or sniff). This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. Sorry but this is missing the point. The point here is to improve the open() function. I'm sure people who know about encodings are able to install the chardet library or even whip up their own BOM detection routine... How does the requirement that it be implemented as a codec miss the point? FWIW, I agree with Walter that if it is provided through the encoding= argument, it should be a codec. If it is built into the open function (for whatever reason), it must be provided by some other parameter. I do see the point that it becomes available to end users only when released as part of Python. However, this *also* means that applications won't be using it for another three years or so, since they'll have to support older Python versions as well (unless it is integrated in the case where no encoding is specified). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Martin v. Löwis martin at v.loewis.de writes: Sorry but this is missing the point. The point here is to improve the open() function. I'm sure people who know about encodings are able to install the chardet library or even whip up their own BOM detection routine... How does the requirement that it be implemented as a codec miss the point? If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. FWIW, I agree with Walter that if it is provided through the encoding= argument, it should be a codec. If it is built into the open function (for whatever reason), it must be provided by some other parameter. Why not simply encoding=None? The default value should provide the most useful behaviour possible. Forcing users to choose between two different autodetection strategies (encoding=None and another one) is a little insane IMO. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou solip...@pitrou.net wrote: If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. Right. It seems like encoding=None is the right way to go there. encoding='BOM' would probably only work if 'BOM' isn't an encoding but a special tag, which is ugly. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On 09/01/2010 22:14, Lennart Regebro wrote: On Sat, Jan 9, 2010 at 21:28, Antoine Pitrousolip...@pitrou.net wrote: If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. Right. It seems like encoding=None is the right way to go there. encoding='BOM' would probably only work if 'BOM' isn't an encoding but a special tag, which is ugly. I would rather see it as the default behavior for open without an encoding specified. I know Guido has expressed a preference against this so I won't continue to flog it. The current behavior however is that we have a 'guessing' algorithm based on the platform default. Currently if you open a text file in read mode that has a UTF-8 signature, but the platform default is something other than UTF-8, then we open the file using what is likely to be the incorrect encoding. Looking for the signature seems to be better behaviour in that case. All the best, Michael -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
How does the requirement that it be implemented as a codec miss the point? If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. Yes - however, Victor currently apparently *doesn't* want it to be the default, but wants the user to specify encoding=BOM. If so, it isn't the default, and it is easy to implement as a codec. FWIW, I agree with Walter that if it is provided through the encoding= argument, it should be a codec. If it is built into the open function (for whatever reason), it must be provided by some other parameter. Why not simply encoding=None? I don't mind. Please re-read Walter's message - it only said that *if* this is activated through encoding=BOM, *then* it must be a codec, and could be on PyPI. I don't think Walter was talking about the case it is not activated through encoding='BOM' *at all*. The default value should provide the most useful behaviour possible. Forcing users to choose between two different autodetection strategies (encoding=None and another one) is a little insane IMO. That wouldn't disturb me much. There are a lot of things in that area that are a little insane, starting with Microsoft Windows :-) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. I think what Glyph meant is this: if a file starts with the UTF-8 signature, assume it's UTF-8. Then validate the assumption against the rest of the file also, and then process it as UTF-8. If the rest clearly is not UTF-8, assume that the UTF-8 signature is bogus. I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor). FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
But it should do something sane when reading such files. I can't really see any harm in throwing it away, especially since use of ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated IIRC. And indeed it does, when you open the file in the utf-8-sig encoding. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Le vendredi 08 janvier 2010 03:23:08, MRAB a écrit : Guido van Rossum wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? Alternatively, have a universal UTF-8/16/32 encoding, ie one that expects UTF-8, with or without BOM, or UTF-16/32 with BOM. Do you mean open(filename, encoding=BOM)? I suppose that BOM would be a magical value specific to read a text file (open(filename, r)), not a real codec? Otherwise which encoding should be used for open(filename, w, encoding=BOM)? -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. It depends. If you use the utf-8-sig encoding, it *will* ignore the UTF-8 signature. Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? Absolutely. And there is no need to produce a new option, but instead use the existing options: define an encoding that auto-detects the encoding from the family of BOMs. Maybe you call it encoding=sniff. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit : (...) (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) I wrote a new version of my patch (version 3): * don't change the default behaviour: use open(filename, encoding=BOM) to check the BOM is there is any * fix for seek(0): always ignore the BOM * add an unit test: check that the right encoding is detect, but also the the BOM is ignored (especially after a seek(0)) BOM encoding doesn't work for writing into a file, so open(filename, w, encoding=BOM) raises a ValueError. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Le vendredi 08 janvier 2010 01:52:20, Guido van Rossum a écrit : And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? I choosed to modify open()+TextIOWrapper instead of writing a new function because I would like to avoid an extra read operation (syscall) on the file. With my implementation, no specific read operation is needed to detect the BOM. The BOM is simply checked in the first _read_chunk(). -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit : Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. It depends. If you use the utf-8-sig encoding, it *will* ignore the UTF-8 signature. Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to remove the BOM after the first read (much harder if you use a module like ConfigParser or csv). Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? Absolutely. And there is no need to produce a new option, but instead use the existing options: define an encoding that auto-detects the encoding from the family of BOMs. Maybe you call it encoding=sniff. Good idea, I choosed open(filename, encoding=BOM). -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Victor Stinner victor.stinner at haypocalc.com writes: I wrote a new version of my patch (version 3): * don't change the default behaviour: use open(filename, encoding=BOM) to check the BOM is there is any Well, I think if we implement this the default behaviour *should* be changed. It looks a bit senseless to have two different auto-choose options, one with encoding=None and one with encoding=BOM. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Thu, Jan 7, 2010 at 11:55 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: I'm saying that the BOM itself isn't enough to detect that the file is actually UTF-8. And I'm saying that it is, with as much certainty as we can ever guess the encoding of a file. If (for whatever reason: explicitly specified, guessed in some other way) the file's encoding is determined to be something else, the bytes comprising the BOM should be decoded as normal. It's just that the UTF-8 decoding of the BOM at the start of a file should be . Sure, a Latin-1-encoded file could start with the same pattern that is a UTF-8-encoded BOM. But at that point, a UTF-16-encoded file is also valid Latin-1. The question was in the context of encoding-guessing; if we're guessing, a UTF-8-encoded BOM cannot signify anything else but UTF-8. (Ditto for UTF-16 and UTF-32 BOMs.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou solip...@pitrou.net wrote: Victor Stinner victor.stinner at haypocalc.com writes: I wrote a new version of my patch (version 3): * don't change the default behaviour: use open(filename, encoding=BOM) to check the BOM is there is any Well, I think if we implement this the default behaviour *should* be changed. It looks a bit senseless to have two different auto-choose options, one with encoding=None and one with encoding=BOM. Well there *are* two different auto options: use the environment variables (LANG etc.) or inspect the contents of the file. I think it would be useful to have ways to specify both. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Fri, Jan 8, 2010 at 1:05 AM, Martin v. Löwis mar...@v.loewis.de wrote: It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. I think what Glyph meant is this: if a file starts with the UTF-8 signature, assume it's UTF-8. Then validate the assumption against the rest of the file also, and then process it as UTF-8. If the rest clearly is not UTF-8, assume that the UTF-8 signature is bogus. I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor). FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy. Sure. I said crazy talk only to stir up discussion. Which worked. :-) Also, I don't want Python's default behavior to change -- sniffing the encoding should be a separate option. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver tsea...@palladion.com wrote: The BOM should not be seekeable if the file is opened with the proposed guess encoding from BOM mode: it isn't properly part of the stream at all in that case. This feels about right to me. There are still questions though: immediately after opening a file with a BOM, what should .tell() return? And regardless of that, .seek(0) should put the file in that same initial state. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Guido van Rossum guido at python.org writes: Well, I think if we implement this the default behaviour *should* be changed. It looks a bit senseless to have two different auto-choose options, one with encoding=None and one with encoding=BOM. Well there *are* two different auto options: use the environment variables (LANG etc.) or inspect the contents of the file. I think it would be useful to have ways to specify both. Yes, perhaps. In the context of open() however I think it would be helpful to change the default. Moreover, reading the BOM is certainly much more reliable than our current guessing based on the locale or the device encoding. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Guido van Rossum wrote: On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou solip...@pitrou.net wrote: Victor Stinner victor.stinner at haypocalc.com writes: I wrote a new version of my patch (version 3): * don't change the default behaviour: use open(filename, encoding=BOM) to check the BOM is there is any Well, I think if we implement this the default behaviour *should* be changed. It looks a bit senseless to have two different auto-choose options, one with encoding=None and one with encoding=BOM. Well there *are* two different auto options: use the environment variables (LANG etc.) or inspect the contents of the file. I think it would be useful to have ways to specify both. Shouldn't this encoding guessing be a separate function that you call on either a file or a seekable stream ? After all, detecting encodings is just as useful to have for non-file streams. You'd then avoid having to stuff everything into a single function call and also open up the door for more complex application specific guess work or defaults. The whole process would then have two steps: 1. guess encoding import codecs encoding = codecs.guess_file_encoding(filename) 2. open the file with the found encoding f = open(filename, encoding=encoding) For seekable streams f, you'd have: 1. guess encoding import codecs encoding = codecs.guess_stream_encoding(f) 2. wrap the stream with a reader for the found encoding reader_class = codecs.getreader(encoding) g = reader_class(f) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 08 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Guido van Rossum guido at python.org writes: On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver tseaver at palladion.com wrote: The BOM should not be seekeable if the file is opened with the proposed guess encoding from BOM mode: it isn't properly part of the stream at all in that case. This feels about right to me. There are still questions though: immediately after opening a file with a BOM, what should .tell() return? tell() in the context of text I/O is specified to return an opaque cookie. So whatever value it returns would probably be fine, as long as seeking to that value leaves the file in an acceptable state. Rewinding (seeking to 0) in the presence of a BOM is already reasonably supported by the TextIOWrapper object: dec = codecs.getincrementaldecoder('utf-16')() dec.decode(b'\xff\xfea\x00b\x00') 'ab' dec.decode(b'\xff\xfea\x00b\x00') '\ufeffab' bio = io.BytesIO(b'\xff\xfea\x00b\x00') f = io.TextIOWrapper(bio, encoding='utf-16') f.read() 'ab' f.seek(0) 0 f.read() 'ab' There are tests for this in test_io.py (test_encoded_writes, line 1929, and test_append_bom and test_seek_bom, line 2045). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Victor Stinner wrote: Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit : (...) (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) I wrote a new version of my patch (version 3): * don't change the default behaviour: use open(filename, encoding=BOM) to check the BOM is there is any * fix for seek(0): always ignore the BOM * add an unit test: check that the right encoding is detect, but also the the BOM is ignored (especially after a seek(0)) BOM encoding doesn't work for writing into a file, so open(filename, w, encoding=BOM) raises a ValueError. I think it's similar to universal newline mode. You can tell it that you're reading UTF-something-encoded text (common forms only). The preference is UTF-8, but it could be UTF-8-sig (from Windows), or possibly UTF-16/32, which really need a BOM because there are multiple bytes per codepoint, so the bytes could be big-endian or little-endian. The BOM (or signature) tells you what the encoding is, defaulting to UTF-8 if there's none. If it subsequently raises a DecodeError, then so be it! Maybe there should also be a way of determining what encoding it decided it was, so that you can then write a new file in that same encoding. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Guido van Rossum wrote: On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver tsea...@palladion.com wrote: The BOM should not be seekeable if the file is opened with the proposed guess encoding from BOM mode: it isn't properly part of the stream at all in that case. This feels about right to me. There are still questions though: immediately after opening a file with a BOM, what should .tell() return? And regardless of that, .seek(0) should put the file in that same initial state. I think the behavior should be something like: f = open('/path/to/maybe-BOM-encoded-file', 'r', encoding='BOM') f.tell() 0L f.seek(-1) f.tell() # count of unicode chars in decoded stream 45L f.seek(0) f.read(1) # read first unicode char decoded from stream. 'A' In other words, the BOM is not readable / seekable at all: it is invisible to the consumer of the decoded stream. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktHnyIACgkQ+gerLs4ltQ6s3QCgznD+7FbUzfCbe5TS6OcoXjMg rdgAoJAMEXe2xwLCIwJaZ6XA6rVyTIAi =oXb3 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 M.-A. Lemburg wrote: Shouldn't this encoding guessing be a separate function that you call on either a file or a seekable stream ? After all, detecting encodings is just as useful to have for non-file streams. Other stream sources typically have out-of-band ways to signal the encoding: only when reading from the filesystem do we pretty much *have* to guess, and in that case the BOM / signature is the best heuristic we have. Also, some non-file streams are not seekable, and so can't be guessed via a pre-pass. You'd then avoid having to stuff everything into a single function call and also open up the door for more complex application specific guess work or defaults. The whole process would then have two steps: 1. guess encoding import codecs encoding = codecs.guess_file_encoding(filename) Filename is not enough information: or do you mean that API to actually open the stream? 2. open the file with the found encoding f = open(filename, encoding=encoding) For seekable streams f, you'd have: 1. guess encoding import codecs encoding = codecs.guess_stream_encoding(f) 2. wrap the stream with a reader for the found encoding reader_class = codecs.getreader(encoding) g = reader_class(f) Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktHoU4ACgkQ+gerLs4ltQ5o3QCeLOJ7J91E+5f66vhgu1BUhYh4 9UgAnR2IeCd0BCsPez8ZilGNHJfhRn3Y =SoPb -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Martin v. Löwis wrote: It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. I think what Glyph meant is this: if a file starts with the UTF-8 signature, assume it's UTF-8. Then validate the assumption against the rest of the file also, and then process it as UTF-8. If the rest clearly is not UTF-8, assume that the UTF-8 signature is bogus. If the programmer opens the file using a guess using the BOM encoding, Python should *not* attempt to verify that the file is properly encoded: it should check for (and consume) any BOM, and then return a stream which uses the encoding inferred from the BOM. Any errors should be handled later, when characters are read, exactly as if the file had been opened with the same encoding guessed from the BOM. I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor). FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy. Agreed. Having that marker at the start of the file makes interop with other tools *much* easier. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktHoFMACgkQ+gerLs4ltQ73dACffwUfyh6Q9vUnKYf367QFjNcU RRMAoNuKCWEx7j+MSdTv+UjhAPynBc14 =uAX6 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Shouldn't this encoding guessing be a separate function that you call on either a file or a seekable stream ? After all, detecting encodings is just as useful to have for non-file streams. Other stream sources typically have out-of-band ways to signal the encoding: only when reading from the filesystem do we pretty much *have* to guess, and in that case the BOM / signature is the best heuristic we have. Also, some non-file streams are not seekable, and so can't be guessed via a pre-pass. But what if the file were in (for example) a zip file? I think you definitely want to have access to this functionality outside of open(). Eric. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Jan 8, 2010, at 4:14 PM, Tres Seaver wrote: I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor). FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy. Agreed. Having that marker at the start of the file makes interop with other tools *much* easier. Putting the BOM at the beginning of UTF-8 text files is not a good idea, it makes interop much *worse* on a unix system, not better. Without the BOM, most commands do the right thing with UTF-8 text. E.g. to concatenate two files: $ cat file-1 file-2 file-3 With a BOM at the beginning of the file, it won't work right. Of course, you could modify cat (and every other stream processing command) to know how to consume and emit BOMs, and omit the extra one that would show up in the middle of the stream...but even that can't work; what about: $ (cat file-1; cat file-2) file-3. Should the shell now know that when you run multiple commands, it should eat the BOM emitted from the second command? Basically, using a BOM in a utf-8 file is just not a good idea: it completely ruins interop with every standard unix tool. This is not to say that Python shouldn't have a way to read a file with a UTF-8 BOM: it just shouldn't encourage you to *write* such files. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Tres Seaver wrote: M.-A. Lemburg wrote: Shouldn't this encoding guessing be a separate function that you call on either a file or a seekable stream ? After all, detecting encodings is just as useful to have for non-file streams. Other stream sources typically have out-of-band ways to signal the encoding: only when reading from the filesystem do we pretty much *have* to guess, and in that case the BOM / signature is the best heuristic we have. Also, some non-file streams are not seekable, and so can't be guessed via a pre-pass. Sure there are non-seekable file streams, but at least when using StringIO-type streams you don't have that restriction. An encoding detection function would provide more use in other cases as well, so instead of hiding away the functionality in the open() constructor, I'm suggesting to make expose it via the codecs module. Applications can then use it where necessary and also provide their own defaults, using other heuristics. You'd then avoid having to stuff everything into a single function call and also open up the door for more complex application specific guess work or defaults. The whole process would then have two steps: 1. guess encoding import codecs encoding = codecs.guess_file_encoding(filename) Filename is not enough information: or do you mean that API to actually open the stream? Yes. The API would open the file, guess the encoding and then close it again. If you don't want that to happen, you could use the second API I mentioned below on the already open file. Note that this function could detect a lot more encodings with reasonably high probability than just BOM-prefixed ones, e.g. we could also add support to detect encoding declarations such as the ones we use in Python source files. 2. open the file with the found encoding f = open(filename, encoding=encoding) For seekable streams f, you'd have: 1. guess encoding import codecs encoding = codecs.guess_stream_encoding(f) I forgot to mention: This API needs to position the file pointer to the start of the first data byte. 2. wrap the stream with a reader for the found encoding reader_class = codecs.getreader(encoding) g = reader_class(f) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 08 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eric Smith wrote: Shouldn't this encoding guessing be a separate function that you call on either a file or a seekable stream ? After all, detecting encodings is just as useful to have for non-file streams. Other stream sources typically have out-of-band ways to signal the encoding: only when reading from the filesystem do we pretty much *have* to guess, and in that case the BOM / signature is the best heuristic we have. Also, some non-file streams are not seekable, and so can't be guessed via a pre-pass. But what if the file were in (for example) a zip file? I think you definitely want to have access to this functionality outside of open(). If the application expects a possibly-BOM-signature-marked file, but you pass it mismatched garbage: f = open('some.zip', encoding='BOM) the error handling should be the same as if you passed any other mismatched encoding: f = open('some.zip', encoding='UTF8') i.e., you discover the error when you try to read from the (non)encoded stream, not when you open it. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktHqpwACgkQ+gerLs4ltQ7uAACeKEc+WT4TASGcVl1Hfqe6L9La I6EAn1pJtngtLWPdothGbYB+zUabEvTW =TjBK -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Le vendredi 08 janvier 2010 22:40:47, Eric Smith a écrit : Shouldn't this encoding guessing be a separate function that you call on either a file or a seekable stream ? After all, detecting encodings is just as useful to have for non-file streams. Other stream sources typically have out-of-band ways to signal the encoding: only when reading from the filesystem do we pretty much *have* to guess, and in that case the BOM / signature is the best heuristic we have. Also, some non-file streams are not seekable, and so can't be guessed via a pre-pass. But what if the file were in (for example) a zip file? I think you definitely want to have access to this functionality outside of open(). FYI my patch (encoding=BOM) is implemented in TextIOWrapper, and TextIOWrapper takes a binary stream as input, not a filename. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Am 08.01.2010 22:14, schrieb Tres Seaver: FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy. Agreed. Having that marker at the start of the file makes interop with other tools *much* easier. Except if only 50% of the other tools support the signature. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
MRAB wrote: Maybe there should also be a way of determining what encoding it decided it was, so that you can then write a new file in that same encoding. I thought of that question as well - the f.encoding attribute on the opened file should be sufficient. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Improve open() to support reading file starting with an unicode BOM
Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. See recent issues related to reading an UTF-8 text file including a BOM: #7185 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with the UTF-8-SIG encoding, but it's possible to do better. I propose to improve open() (TextIOWrapper) by using the BOM to choose the right encoding. I think that only files opened in read only mode should support this new feature. *Read* the BOM in a *write* only file would cause unexpected behaviours. Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? I wrote a proof of concept attached to the issue #7651. My patch only changes the behaviour of TextIOWrapper for reading files starting with a BOM. It doesn't work yet if a seek() is used before the first read. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? --Guido On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. See recent issues related to reading an UTF-8 text file including a BOM: #7185 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with the UTF-8-SIG encoding, but it's possible to do better. I propose to improve open() (TextIOWrapper) by using the BOM to choose the right encoding. I think that only files opened in read only mode should support this new feature. *Read* the BOM in a *write* only file would cause unexpected behaviours. Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? I wrote a proof of concept attached to the issue #7651. My patch only changes the behaviour of TextIOWrapper for reading files starting with a BOM. It doesn't work yet if a seek() is used before the first read. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Guido van Rossum wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? Alternatively, have a universal UTF-8/16/32 encoding, ie one that expects UTF-8, with or without BOM, or UTF-16/32 with BOM. On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. See recent issues related to reading an UTF-8 text file including a BOM: #7185 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with the UTF-8-SIG encoding, but it's possible to do better. I propose to improve open() (TextIOWrapper) by using the BOM to choose the right encoding. I think that only files opened in read only mode should support this new feature. *Read* the BOM in a *write* only file would cause unexpected behaviours. Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? I wrote a proof of concept attached to the issue #7651. My patch only changes the behaviour of TextIOWrapper for reading files starting with a BOM. It doesn't work yet if a seek() is used before the first read. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Guido van Rossum writes: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. That doesn't stop many applications from doing it. Python should perhapswink,nudge not produce UTF-8 + BOM without a disclaimer of indemnification against all resulting damage, signed in blood, from the user for each instance. But it should do something sane when reading such files. I can't really see any harm in throwing it away, especially since use of ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated IIRC. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Guido van Rossum wrote: On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) The BOM should not be seekeable if the file is opened with the proposed guess encoding from BOM mode: it isn't properly part of the stream at all in that case. A UTF-8 BOM is an absurditiy, but it exists *everywhere* in the wild: Python would do wll to make it as easy as possible to consume such files, as well as the non-insane versions (UTF-16 / UTF-32 BOMs). In the best of all possible worlds, I would just try opening the file so: f = open('/path/to/file', 'r', encoding=DWIFM) and any BOM present would set the encoding for the remainder of the stream.. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktGzLsACgkQ+gerLs4ltQ5+cwCdGfycPdj6+cPfD23vH644SpHL sI0AoLGD7nfgMEJdJhBr90yjQQHfDgcJ =js+2 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Jan 7, 2010, at 11:21 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. I'm saying that the BOM itself isn't enough to detect that the file is actually UTF-8. If (for whatever reason: explicitly specified, guessed in some other way) the file's encoding is determined to be something else, the bytes comprising the BOM should be decoded as normal. It's just that the UTF-8 decoding of the BOM at the start of a file should be . (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) I think it's pretty clear that the BOM should still be skipped in that case ... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com