Re: [Python-Dev] Quick sum up about open() + BOM
Hi, Le samedi 09 janvier 2010 13:45:58, vous avez écrit : > > Note: I implemented the BOM check in TextIOWrapper; so it's already > > usable for any file-like object. > > Yes, but the implementation is limited to just BOM checking > and thus only supports UTF-8-SIG, UTF-16 and UTF-32. Sure, but that's already better than no BOM check :-) It looks like many people would apprecite UTF-8-SIG detection, since this encoding is common on Windows. > BTW: I haven't looked at your implementation, but what happens > when your BOM check fails ? Will the implementation add the > already read bytes back to a buffer ? My implementation is done between buffer.read() and decoder.decode(data). If there is a BOM: set the encoding and remove the BOM bytes from the byte string. Otherwise, use another algorithm to choose the encoding and leave the byte string unchanged. It can be seen as a codec: it works like UTF-16 and UTF-32 codecs ;-) > AFAIK, we currently have a moratorium on changes to Python > builtins. How does that match up with the proposed changes ? Oh yes, I forgot the moratorium. In all solutions, some of them don't change the API. Eg. Antoine proposed to leave the API unchanged: open(file) => open(file) :-) I don't know if it's compatible with the moratorium or not. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Le samedi 09 janvier 2010 02:12:28, MRAB a écrit : > What about listing the possible encodings? It would try each in turn > until it found one where the BOM matched or had no BOM: > > my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8') > > or is that taking it too far? Yes, you're taking it foo far :-) Checking BOM is reliable, whereas *guessing* the charset only using the byte stream can only be an heuristic. Guess a charset is a complex problem, they are 3rd party library to do that, like the chardet project. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Le samedi 09 janvier 2010 01:47:38, vous avez écrit : > One concern I have with this implementation encoding="BOM" is that if > there is no BOM it assumes UTF-8. If no BOM is found, it fallback to the current heuristic: os.device_encoding() or system local. > (...) Hence, it might be that someone would expect a UTF-16LE (or any of > the formats that don't require a BOM, rather than UTF-8), but be willing > to accept any BOM-discriminated format. > (...) declare that they will accept > any BOM-discriminated format, but want to default, in the absence of a > BOM, to the original national language locale that they historically > accepted You mean "if there is a BOM, use it, otherwise fallback to a specific charset"? How could it be declared? Maybe: open("file.txt", check_bom=True, encoding="UTF16-LE") open("file.txt", check_bom=True, encoding="latin1") About falling back to UTF-8, it would be written: open("file.txt", check_bom=True, encoding="UTF-8") As explained before, check_bom=True is only accepted for read only file mode. Well, why not. This is a third choice for my point (1) :-) It's between Guido and Antoine choice, and I like it because we can fallback to UTF-8 instead of the dummy system locale: Windows users will be happy to be able to use UTF-8 :-) I prefer to fallback to a fixed encoding then depending on the system locale. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Victor Stinner wrote: > (2) Check for a BOM while reading or detect it before? > > Everybody agree that checking BOM is an interesting option and should not be > limited to open(). > > Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file > name or a binary file-like object: it returns the encoding and seek to the > file start or just after the BOM. > > I dislike this function because it requires extra file operations (open > (optional), read() and seek()) and it doesn't work if the file is not > seekable > (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to > avoid extra file operations. > > Note: I implemented the BOM check in TextIOWrapper; so it's already usable > for > any file-like object. Yes, but the implementation is limited to just BOM checking and thus only supports UTF-8-SIG, UTF-16 and UTF-32. With a codecs module function we could easily extend the encoding detection to more file types, e.g. XML files, Python source code files, etc. that use other mechanisms for defining the encoding. BTW: I haven't looked at your implementation, but what happens when your BOM check fails ? Will the implementation add the already read bytes back to a buffer ? This rollback action is the only reason for needing a seekable stream in codecs.guess_stream_encoding(). Another point to consider: AFAIK, we currently have a moratorium on changes to Python builtins. How does that match up with the proposed changes ? Using a new codec like Walter suggested would move the implementation into the stdlib for which doesn't the moratorium doesn't apply. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 09 2010) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Le samedi 09 janvier 2010 02:23:07, Martin v. Löwis a écrit : > While I would support combining BOM detection in the case where a file > is opened for reading and no encoding is specified, I see two problems: > a) if a seek operations is performed before having looked at the BOM, >no determination would have been made TextIOWrapper doesn't support seek to an arbitrary byte. It uses "cookie" which is an opaque value. Reuse a cookie from another file or an old cookie is forbidden (but it doesn't raise an error). This is not specific to the BOM checking: the problem already exist for encodings using a BOM (eg. UTF-16). > b) what encoding should it use on writing? Don't change anything to writing. With Antoince choice: open('file.txt', 'w', encoding=None) continue to use the actual heuristic (os.device_encoding() or system locale). With Guido choice, encoding="BOM": it raises an error, because BOM check is not supported when writing into a file. How could the BOM be checked when creating a new (empty) file!? -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
On 09.01.10 01:47, Glenn Linderman wrote: > On approximately 1/8/2010 3:59 PM, came the following characters from > the keyboard of Victor Stinner: >> Hi, >> >> Thanks for all the answers! I will try to sum up all ideas here. > > One concern I have with this implementation encoding="BOM" is that if > there is no BOM it assumes UTF-8. That is probably a good assumption in > some circumstances, but not in others. > > * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE > encoded files include a BOM. It is only required that UTF-16 and UTF-32 > (cases where the endianness is unspecified) contain a BOM. Hence, it > might be that someone would expect a UTF-16LE (or any of the formats > that don't require a BOM, rather than UTF-8), but be willing to accept > any BOM-discriminated format. > > * Potentially, this could be expanded beyond the various Unicode > encodings... one could envision that a program whose data files > historically were in any particular national language locale, could want > to be enhance to accept Unicode, and could declare that they will accept > any BOM-discriminated format, but want to default, in the absence of a > BOM, to the original national language locale that they historically > accepted. That would provide a migration path for their old data files. > > So the point is, that it might be nice to have > "BOM-otherEncodingForDefault" for each other encoding that Python > supports. Not sure that is the right API, but I think it is expressive > enough to handle the cases above. Whether the cases solve actual > problems or not, I couldn't say, but they seem like reasonable cases. This is doable with the currect API. Simply define a codec search function that handles all encoding names that start with "BOM-" and pass the "otherEncodingForDefault" part along to the codec. > It would, of course, be nicest if OS metadata had been invented way back > when, for all OSes, such that all text files were flagged with their > encoding... then languages could just read the encoding and do the right > thing! But we live in the real world, instead. Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
It seems to me that when opening a file, the following is the only flow that makes sense for the typical opening of a file flow: if encoding is not None: use encoding elif file has BOM: use BOM else: use system default And hence a encoding='BOM' isn't needed there. Although I'm trying to come up with usecases that doesn't work with this, I can't. :) BUT When writing things are not so easy though. Apparently some encodings require a BOM to be written, but others do not, but allow it, and some has no byte order mark. So there you have to be able to write the BOM, or not. And that's either a new parameter, because you can't use encoding='BOM' since you need to specify the encoding as well, or a new method. I would suggest a BOM parameter, and maybe a method as well. BOM=None|True|False Where "None" means a sane default behaviour, that is write a BOM if the encoding require it. "True" means write a BOM if the encoding *supports* it. "False" means Don't write a BOM even if the encoding requires it (because I know what I'm doing) if 'w' in mode: # But not 'r' or 'a' if BOM == True and encoding in (ENCODINGS THAT ALLOW BOM): write_bom = True elif BOM == False: write_bom = False elif BOM == None and encoding in (ENCODINGS THAT REQUIRE BOM): write_bom = True else: write_bom = False else: write_bom = False For reading this parameter could either be a noop, or possibly change the behavior somehow, if a usecase where that makes sense can be imagined. -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python-incompatibility.googlecode.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
On approximately 1/8/2010 5:12 PM, came the following characters from the keyboard of MRAB: Glenn Linderman wrote: On approximately 1/8/2010 3:59 PM, came the following characters from the keyboard of Victor Stinner: Hi, Thanks for all the answers! I will try to sum up all ideas here. One concern I have with this implementation encoding="BOM" is that if there is no BOM it assumes UTF-8. That is probably a good assumption in some circumstances, but not in others. * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE encoded files include a BOM. It is only required that UTF-16 and UTF-32 (cases where the endianness is unspecified) contain a BOM. Hence, it might be that someone would expect a UTF-16LE (or any of the formats that don't require a BOM, rather than UTF-8), but be willing to accept any BOM-discriminated format. * Potentially, this could be expanded beyond the various Unicode encodings... one could envision that a program whose data files historically were in any particular national language locale, could want to be enhance to accept Unicode, and could declare that they will accept any BOM-discriminated format, but want to default, in the absence of a BOM, to the original national language locale that they historically accepted. That would provide a migration path for their old data files. So the point is, that it might be nice to have "BOM-otherEncodingForDefault" for each other encoding that Python supports. Not sure that is the right API, but I think it is expressive enough to handle the cases above. Whether the cases solve actual problems or not, I couldn't say, but they seem like reasonable cases. It would, of course, be nicest if OS metadata had been invented way back when, for all OSes, such that all text files were flagged with their encoding... then languages could just read the encoding and do the right thing! But we live in the real world, instead. What about listing the possible encodings? It would try each in turn until it found one where the BOM matched or had no BOM: my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8') or is that taking it too far? That sounds very flexible -- but in net effect it would only make illegal a subset of the BOM-containing encodings (those not listed) without making legal any additional encodings other than the non-BOM encoding. Whether prohibiting a subset of BOM-containing encodings is a useful use case, I couldn't say... but my goal would be to included as many different file encodings on input as possible: without a BOM, that is exactly 1 (unless there are other heuristics), with a BOM, it is 1+all-BOM-containing encodings. Your scheme would permit numbers of encodings accepted to vary between 1 and 1+all-BOM-containing encodings. (I think everyone can agree there are 5 different byte sequences that can be called a Unicode BOM. The likelihood of them appearing in any other text encoding created by mankind depends on those other encodings -- but it is not impossible. It is truly up to the application to decide whether BOM detection could potentially conflict with files in some other encoding that would be acceptable to the application.) So I think it is taking it further than I can see value in, but I'm willing to be convinced otherwise. I see only a need for detecting BOM, and specifying a default encoding to be used if there is no BOM. Note that it might be nice to have a specification for using current encoding=None heuristic -- perhaps encoding="BOM-None" per my originally proposed syntax. But I'm still not saying that is the best syntax. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
>>> Antoine would like to check BOM by default, because both options >>> (system locale vs checking for BOM) is the same thing. >>> >> To be clear, I am not saying it is the same thing. What I think is >> that it would be a mistake to use a mildly unreliable heuristic by >> default (the locale + device encoding heuristic) but refuse to >> trust a more reliable heuristic (the BOM-based detection >> algorithm). >> > > I concur. On Windows both UTF-8 and signature are very common, yet > the platform default is the truly awful CP1252. While I would support combining BOM detection in the case where a file is opened for reading and no encoding is specified, I see two problems: a) if a seek operations is performed before having looked at the BOM, no determination would have been made b) what encoding should it use on writing? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Glenn Linderman wrote: On approximately 1/8/2010 3:59 PM, came the following characters from the keyboard of Victor Stinner: Hi, Thanks for all the answers! I will try to sum up all ideas here. One concern I have with this implementation encoding="BOM" is that if there is no BOM it assumes UTF-8. That is probably a good assumption in some circumstances, but not in others. * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE encoded files include a BOM. It is only required that UTF-16 and UTF-32 (cases where the endianness is unspecified) contain a BOM. Hence, it might be that someone would expect a UTF-16LE (or any of the formats that don't require a BOM, rather than UTF-8), but be willing to accept any BOM-discriminated format. * Potentially, this could be expanded beyond the various Unicode encodings... one could envision that a program whose data files historically were in any particular national language locale, could want to be enhance to accept Unicode, and could declare that they will accept any BOM-discriminated format, but want to default, in the absence of a BOM, to the original national language locale that they historically accepted. That would provide a migration path for their old data files. So the point is, that it might be nice to have "BOM-otherEncodingForDefault" for each other encoding that Python supports. Not sure that is the right API, but I think it is expressive enough to handle the cases above. Whether the cases solve actual problems or not, I couldn't say, but they seem like reasonable cases. It would, of course, be nicest if OS metadata had been invented way back when, for all OSes, such that all text files were flagged with their encoding... then languages could just read the encoding and do the right thing! But we live in the real world, instead. What about listing the possible encodings? It would try each in turn until it found one where the BOM matched or had no BOM: my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8') or is that taking it too far? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
On approximately 1/8/2010 3:59 PM, came the following characters from the keyboard of Victor Stinner: Hi, Thanks for all the answers! I will try to sum up all ideas here. One concern I have with this implementation encoding="BOM" is that if there is no BOM it assumes UTF-8. That is probably a good assumption in some circumstances, but not in others. * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE encoded files include a BOM. It is only required that UTF-16 and UTF-32 (cases where the endianness is unspecified) contain a BOM. Hence, it might be that someone would expect a UTF-16LE (or any of the formats that don't require a BOM, rather than UTF-8), but be willing to accept any BOM-discriminated format. * Potentially, this could be expanded beyond the various Unicode encodings... one could envision that a program whose data files historically were in any particular national language locale, could want to be enhance to accept Unicode, and could declare that they will accept any BOM-discriminated format, but want to default, in the absence of a BOM, to the original national language locale that they historically accepted. That would provide a migration path for their old data files. So the point is, that it might be nice to have "BOM-otherEncodingForDefault" for each other encoding that Python supports. Not sure that is the right API, but I think it is expressive enough to handle the cases above. Whether the cases solve actual problems or not, I couldn't say, but they seem like reasonable cases. It would, of course, be nicest if OS metadata had been invented way back when, for all OSes, such that all text files were flagged with their encoding... then languages could just read the encoding and do the right thing! But we live in the real world, instead. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
On 09/01/2010 00:09, Antoine Pitrou wrote: Hello Victor, Victor Stinner haypocalc.com> writes: (1) Change default open() behaviour or make it optional? [...] Antoine would like to check BOM by default, because both options (system locale vs checking for BOM) is the same thing. To be clear, I am not saying it is the same thing. What I think is that it would be a mistake to use a mildly unreliable heuristic by default (the locale + device encoding heuristic) but refuse to trust a more reliable heuristic (the BOM-based detection algorithm). I concur. On Windows both UTF-8 and signature are very common, yet the platform default is the truly awful CP1252. All the best, Michael Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Quick sum up about open() + BOM
Hello Victor, Victor Stinner haypocalc.com> writes: > > (1) Change default open() behaviour or make it optional? > [...] > > Antoine would like to check BOM by default, because both options (system > locale vs checking for BOM) is the same thing. To be clear, I am not saying it is the same thing. What I think is that it would be a mistake to use a mildly unreliable heuristic by default (the locale + device encoding heuristic) but refuse to trust a more reliable heuristic (the BOM-based detection algorithm). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Quick sum up about open() + BOM
Hi, Thanks for all the answers! I will try to sum up all ideas here. (1) Change default open() behaviour or make it optional? Guido would like to add an option and keep open() unchanged. He wrote that checking for BOM and using system locale are too much different to be the same option (encoding=None). Antoine would like to check BOM by default, because both options (system locale vs checking for BOM) is the same thing. About Antoine choice (encoding=None): which file modes would check for a BOM? I would like to answer only the read only mode, but then open(filename, "r") and open(filename, "r+") would behave differently? => 1 point for Guido (encoding="BOM" is more explicit) Antoine choice has the advantage of directly support UTF-8+BOM, UTF-16 and UTF-32 for all applications and all modules using open(filename). => 1 point for Antoine (2) Check for a BOM while reading or detect it before? Everybody agree that checking BOM is an interesting option and should not be limited to open(). Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file name or a binary file-like object: it returns the encoding and seek to the file start or just after the BOM. I dislike this function because it requires extra file operations (open (optional), read() and seek()) and it doesn't work if the file is not seekable (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to avoid extra file operations. Note: I implemented the BOM check in TextIOWrapper; so it's already usable for any file-like object. (3) tell() and seek() on a text file starting with a BOM To be consistent with Antoine example: >>> bio = io.BytesIO(b'\xff\xfea\x00b\x00') >>> f = io.TextIOWrapper(bio, encoding='utf-16') >>> f.read() 'ab' >>> f.seek(0) 0 >>> f.read() 'ab' TextIOWrapper: * tell() should return zero at file start, * seek(0) should go be to file start, * and the BOM should always be "ignored". I mean: with open("utf8bom.txt", encoding="BOM") as fp: assert fp.tell() == 0 text = fp.read() # no BOM here fp.seek(0) assert fp.read() == text -- About my patch: - BOM check is explicit: open(filebame, encoding="BOM") - tell() / seek(0) works as expected - BOM bytes are always skipped in read() / readlines() result Hum, I don't know if this email can be called a sum up ;-) -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com