subject:"\[Python\-Dev\] Quick sum up about open\(\) \+ BOM"

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner

Hi,

Le samedi 09 janvier 2010 13:45:58, vous avez écrit :
> > Note: I implemented the BOM check in TextIOWrapper; so it's already
> > usable for any file-like object.
> 
> Yes, but the implementation is limited to just BOM checking
> and thus only supports UTF-8-SIG, UTF-16 and UTF-32.

Sure, but that's already better than no BOM check :-) It looks like many 
people would apprecite UTF-8-SIG detection, since this encoding is common on 
Windows.

> BTW: I haven't looked at your implementation, but what happens
> when your BOM check fails ? Will the implementation add the
> already read bytes back to a buffer ?

My implementation is done between buffer.read() and decoder.decode(data). If 
there is a BOM: set the encoding and remove the BOM bytes from the byte 
string. Otherwise, use another algorithm to choose the encoding and leave the 
byte string unchanged.

It can be seen as a codec: it works like UTF-16 and UTF-32 codecs ;-)

> AFAIK, we currently have a moratorium on changes to Python
> builtins. How does that match up with the proposed changes ?

Oh yes, I forgot the moratorium. In all solutions, some of them don't change 
the API. Eg. Antoine proposed to leave the API unchanged: open(file) => 
open(file) :-) I don't know if it's compatible with the moratorium or not.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner

Le samedi 09 janvier 2010 02:12:28, MRAB a écrit :
> What about listing the possible encodings? It would try each in turn
> until it found one where the BOM matched or had no BOM:
> 
>  my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')
>
> or is that taking it too far?

Yes, you're taking it foo far :-) Checking BOM is reliable, whereas *guessing* 
the charset only using the byte stream can only be an heuristic. Guess a 
charset is a complex problem, they are 3rd party library to do that, like the 
chardet project.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner

Le samedi 09 janvier 2010 01:47:38, vous avez écrit :
> One concern I have with this implementation encoding="BOM" is that if
> there is no BOM it assumes UTF-8.

If no BOM is found, it fallback to the current heuristic: os.device_encoding() 
or system local.

> (...) Hence, it might be that someone would expect a UTF-16LE (or any of 
> the formats that don't require a BOM, rather than UTF-8), but be willing 
> to accept any BOM-discriminated format.
> (...) declare that they will accept
> any BOM-discriminated format, but want to default, in the absence of a
> BOM, to the original national language locale that they historically
> accepted

You mean "if there is a BOM, use it, otherwise fallback to a specific 
charset"? How could it be declared? Maybe:

   open("file.txt", check_bom=True, encoding="UTF16-LE")
   open("file.txt", check_bom=True, encoding="latin1")

About falling back to UTF-8, it would be written:

   open("file.txt", check_bom=True, encoding="UTF-8")

As explained before, check_bom=True is only accepted for read only file mode.

Well, why not. This is a third choice for my point (1) :-) It's between Guido 
and Antoine choice, and I like it because we can fallback to UTF-8 instead of 
the dummy system locale: Windows users will be happy to be able to use UTF-8 
:-) I prefer to fallback to a fixed encoding then depending on the system 
locale.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread M.-A. Lemburg

Victor Stinner wrote:
> (2) Check for a BOM while reading or detect it before?
> 
> Everybody agree that checking BOM is an interesting option and should not be 
> limited to open().
> 
> Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file 
> name or a binary file-like object: it returns the encoding and seek to the 
> file start or just after the BOM.
> 
> I dislike this function because it requires extra file operations (open 
> (optional), read() and seek()) and it doesn't work if the file is not 
> seekable 
> (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to 
> avoid extra file operations.
> 
> Note: I implemented the BOM check in TextIOWrapper; so it's already usable 
> for 
> any file-like object.

Yes, but the implementation is limited to just BOM checking
and thus only supports UTF-8-SIG, UTF-16 and UTF-32.

With a codecs module function we could easily extend the
encoding detection to more file types, e.g. XML files,
Python source code files, etc. that use other mechanisms
for defining the encoding.

BTW: I haven't looked at your implementation, but what happens
when your BOM check fails ? Will the implementation add the
already read bytes back to a buffer ?

This rollback action is the only reason for needing a
seekable stream in codecs.guess_stream_encoding().

Another point to consider:

AFAIK, we currently have a moratorium on changes to Python
builtins. How does that match up with the proposed changes ?

Using a new codec like Walter suggested would move the
implementation into the stdlib for which doesn't the
moratorium doesn't apply.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 09 2010)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Victor Stinner

Le samedi 09 janvier 2010 02:23:07, Martin v. Löwis a écrit :
> While I would support combining BOM detection in the case where a file
> is opened for reading and no encoding is specified, I see two problems:
> a) if a seek operations is performed before having looked at the BOM,
>no determination would have been made

TextIOWrapper doesn't support seek to an arbitrary byte. It uses "cookie" 
which is an opaque value. Reuse a cookie from another file or an old cookie is 
forbidden (but it doesn't raise an error). This is not specific to the BOM 
checking: the problem already exist for encodings using a BOM (eg. UTF-16).

> b) what encoding should it use on writing?

Don't change anything to writing.

With Antoince choice: open('file.txt', 'w', encoding=None) continue to use the 
actual heuristic (os.device_encoding() or system locale).

With Guido choice, encoding="BOM": it raises an error, because BOM check is 
not supported when writing into a file. How could the BOM be checked when 
creating a new (empty) file!?

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Walter Dörwald

On 09.01.10 01:47, Glenn Linderman wrote:

> On approximately 1/8/2010 3:59 PM, came the following characters from
> the keyboard of Victor Stinner:
>> Hi,
>>
>> Thanks for all the answers! I will try to sum up all ideas here.
> 
> One concern I have with this implementation encoding="BOM" is that if
> there is no BOM it assumes UTF-8.  That is probably a good assumption in
> some circumstances, but not in others.
> 
> * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE
> encoded files include a BOM.  It is only required that UTF-16 and UTF-32
> (cases where the endianness is unspecified) contain a BOM.  Hence, it
> might be that someone would expect a UTF-16LE (or any of the formats
> that don't require a BOM, rather than UTF-8), but be willing to accept
> any BOM-discriminated format.
> 
> * Potentially, this could be expanded beyond the various Unicode
> encodings... one could envision that a program whose data files
> historically were in any particular national language locale, could want
> to be enhance to accept Unicode, and could declare that they will accept
> any BOM-discriminated format, but want to default, in the absence of a
> BOM, to the original national language locale that they historically
> accepted.  That would provide a migration path for their old data files.
> 
> So the point is, that it might be nice to have
> "BOM-otherEncodingForDefault" for each other encoding that Python
> supports.  Not sure that is the right API, but I think it is expressive
> enough to handle the cases above.  Whether the cases solve actual
> problems or not, I couldn't say, but they seem like reasonable cases.

This is doable with the currect API. Simply define a codec search
function that handles all encoding names that start with "BOM-" and pass
the "otherEncodingForDefault" part along to the codec.

> It would, of course, be nicest if OS metadata had been invented way back
> when, for all OSes, such that all text files were flagged with their
> encoding... then languages could just read the encoding and do the right
> thing! But we live in the real world, instead.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Lennart Regebro

It seems to me that when opening a file, the following is the only
flow that makes sense for the typical opening of a file flow:

if encoding is not None:
   use encoding
elif file has BOM:
   use BOM
else:
   use system default

And hence a encoding='BOM' isn't needed there. Although I'm trying to
come up with usecases that doesn't work with this, I can't. :)

BUT

When writing things are not so easy though. Apparently some encodings
require a BOM to be written, but others do not, but allow it, and some
has no byte order mark. So there you have to be able to write the BOM,
or not. And that's either a new parameter, because you can't use
encoding='BOM' since you need to specify the encoding as well, or a
new method.

I would suggest a BOM parameter, and maybe a method as  well.

BOM=None|True|False

Where "None" means a sane default behaviour, that is write a BOM if
the encoding require it.
"True" means write a BOM if the encoding *supports* it.
"False" means Don't write a BOM even if the encoding requires it
(because I know what I'm doing)

if 'w' in mode: # But not 'r' or 'a'
if BOM == True and encoding in (ENCODINGS THAT ALLOW BOM):
write_bom = True
elif BOM == False:
   write_bom = False
elif BOM == None and encoding in (ENCODINGS THAT REQUIRE BOM):
  write_bom = True
else:
  write_bom = False
else:
write_bom = False

For reading this parameter could either be a noop, or possibly change
the behavior somehow, if a usecase where that makes sense can be
imagined.

-- 
Lennart Regebro: http://regebro.wordpress.com/
Python 3 Porting: http://python-incompatibility.googlecode.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Glenn Linderman

On approximately 1/8/2010 5:12 PM, came the following characters from 
the keyboard of MRAB:

Glenn Linderman wrote:
On approximately 1/8/2010 3:59 PM, came the following characters from 
the keyboard of Victor Stinner:

Hi,

Thanks for all the answers! I will try to sum up all ideas here.


One concern I have with this implementation encoding="BOM" is that if 
there is no BOM it assumes UTF-8.  That is probably a good assumption 
in some circumstances, but not in others.


* It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
encoded files include a BOM.  It is only required that UTF-16 and 
UTF-32 (cases where the endianness is unspecified) contain a BOM.  
Hence, it might be that someone would expect a UTF-16LE (or any of 
the formats that don't require a BOM, rather than UTF-8), but be 
willing to accept any BOM-discriminated format.


* Potentially, this could be expanded beyond the various Unicode 
encodings... one could envision that a program whose data files 
historically were in any particular national language locale, could 
want to be enhance to accept Unicode, and could declare that they 
will accept any BOM-discriminated format, but want to default, in the 
absence of a BOM, to the original national language locale that they 
historically accepted.  That would provide a migration path for their 
old data files.


So the point is, that it might be nice to have 
"BOM-otherEncodingForDefault" for each other encoding that Python 
supports.  Not sure that is the right API, but I think it is 
expressive enough to handle the cases above.  Whether the cases solve 
actual problems or not, I couldn't say, but they seem like reasonable 
cases.


It would, of course, be nicest if OS metadata had been invented way 
back when, for all OSes, such that all text files were flagged with 
their encoding... then languages could just read the encoding and do 
the right thing! But we live in the real world, instead.



What about listing the possible encodings? It would try each in turn
until it found one where the BOM matched or had no BOM:

my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')

or is that taking it too far?


That sounds very flexible -- but in net effect it would only make 
illegal a subset of the BOM-containing encodings (those not listed) 
without making legal any additional encodings other than the non-BOM 
encoding.  Whether prohibiting a subset of BOM-containing encodings is a 
useful use case, I couldn't say... but my goal would be to included as 
many different file encodings on input as possible: without a BOM, that 
is exactly 1 (unless there are other heuristics), with a BOM, it is 
1+all-BOM-containing encodings.  Your scheme would permit numbers of 
encodings accepted to vary between 1 and 1+all-BOM-containing encodings.


(I think everyone can agree there are 5 different byte sequences that 
can be called a Unicode BOM.  The likelihood of them appearing in any 
other text encoding created by mankind depends on those other encodings 
-- but it is not impossible.  It is truly up to the application to 
decide whether BOM detection could potentially conflict with files in 
some other encoding that would be acceptable to the application.)


So I think it is taking it further than I can see value in, but I'm 
willing to be convinced otherwise.  I see only a need for detecting BOM, 
and specifying a default encoding to be used if there is no BOM.  Note 
that it might be nice to have a specification for using current 
encoding=None heuristic -- perhaps encoding="BOM-None" per my originally 
proposed syntax.  But I'm still not saying that is the best syntax.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Martin v. Löwis

>>> Antoine would like to check BOM by default, because both options
>>> (system locale vs checking for BOM) is the same thing.
>>> 
>> To be clear, I am not saying it is the same thing. What I think is 
>> that it would be a mistake to use a mildly unreliable heuristic by
>> default (the locale + device encoding heuristic) but refuse to
>> trust a more reliable heuristic (the BOM-based detection
>> algorithm).
>> 
> 
> I concur. On Windows both UTF-8 and signature are very common, yet
> the platform default is the truly awful CP1252.

While I would support combining BOM detection in the case where a file
is opened for reading and no encoding is specified, I see two problems:
a) if a seek operations is performed before having looked at the BOM,
   no determination would have been made
b) what encoding should it use on writing?

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread MRAB


Glenn Linderman wrote:
On approximately 1/8/2010 3:59 PM, came the following characters from 
the keyboard of Victor Stinner:

Hi,

Thanks for all the answers! I will try to sum up all ideas here.


One concern I have with this implementation encoding="BOM" is that if 
there is no BOM it assumes UTF-8.  That is probably a good assumption in 
some circumstances, but not in others.


* It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
encoded files include a BOM.  It is only required that UTF-16 and UTF-32 
(cases where the endianness is unspecified) contain a BOM.  Hence, it 
might be that someone would expect a UTF-16LE (or any of the formats 
that don't require a BOM, rather than UTF-8), but be willing to accept 
any BOM-discriminated format.


* Potentially, this could be expanded beyond the various Unicode 
encodings... one could envision that a program whose data files 
historically were in any particular national language locale, could want 
to be enhance to accept Unicode, and could declare that they will accept 
any BOM-discriminated format, but want to default, in the absence of a 
BOM, to the original national language locale that they historically 
accepted.  That would provide a migration path for their old data files.


So the point is, that it might be nice to have 
"BOM-otherEncodingForDefault" for each other encoding that Python 
supports.  Not sure that is the right API, but I think it is expressive 
enough to handle the cases above.  Whether the cases solve actual 
problems or not, I couldn't say, but they seem like reasonable cases.


It would, of course, be nicest if OS metadata had been invented way back 
when, for all OSes, such that all text files were flagged with their 
encoding... then languages could just read the encoding and do the right 
thing! But we live in the real world, instead.



What about listing the possible encodings? It would try each in turn
until it found one where the BOM matched or had no BOM:

my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')

or is that taking it too far?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Glenn Linderman

On approximately 1/8/2010 3:59 PM, came the following characters from 
the keyboard of Victor Stinner:

Hi,

Thanks for all the answers! I will try to sum up all ideas here.


One concern I have with this implementation encoding="BOM" is that if 
there is no BOM it assumes UTF-8.  That is probably a good assumption in 
some circumstances, but not in others.


* It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
encoded files include a BOM.  It is only required that UTF-16 and UTF-32 
(cases where the endianness is unspecified) contain a BOM.  Hence, it 
might be that someone would expect a UTF-16LE (or any of the formats 
that don't require a BOM, rather than UTF-8), but be willing to accept 
any BOM-discriminated format.


* Potentially, this could be expanded beyond the various Unicode 
encodings... one could envision that a program whose data files 
historically were in any particular national language locale, could want 
to be enhance to accept Unicode, and could declare that they will accept 
any BOM-discriminated format, but want to default, in the absence of a 
BOM, to the original national language locale that they historically 
accepted.  That would provide a migration path for their old data files.


So the point is, that it might be nice to have 
"BOM-otherEncodingForDefault" for each other encoding that Python 
supports.  Not sure that is the right API, but I think it is expressive 
enough to handle the cases above.  Whether the cases solve actual 
problems or not, I couldn't say, but they seem like reasonable cases.


It would, of course, be nicest if OS metadata had been invented way back 
when, for all OSes, such that all text files were flagged with their 
encoding... then languages could just read the encoding and do the right 
thing! But we live in the real world, instead.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Michael Foord


On 09/01/2010 00:09, Antoine Pitrou wrote:

Hello Victor,

Victor Stinner  haypocalc.com>  writes:
   

(1) Change default open() behaviour or make it optional?

 

[...]
   

Antoine would like to check BOM by default, because both options (system
locale vs checking for BOM) is the same thing.
 

To be clear, I am not saying it is the same thing. What I think is that it would
be a mistake to use a mildly unreliable heuristic by default (the locale +
device encoding heuristic) but refuse to trust a more reliable heuristic (the
BOM-based detection algorithm).
   


I concur. On Windows both UTF-8 and signature are very common, yet the 
platform default is the truly awful CP1252.


All the best,

Michael

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Antoine Pitrou

Hello Victor,

Victor Stinner  haypocalc.com> writes:
> 
> (1) Change default open() behaviour or make it optional?
> 
[...]
> 
> Antoine would like to check BOM by default, because both options (system 
> locale vs checking for BOM) is the same thing.

To be clear, I am not saying it is the same thing. What I think is that it would
be a mistake to use a mildly unreliable heuristic by default (the locale +
device encoding heuristic) but refuse to trust a more reliable heuristic (the
BOM-based detection algorithm).

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Quick sum up about open() + BOM

2010-01-08 Thread Victor Stinner

Hi,

Thanks for all the answers! I will try to sum up all ideas here.


(1) Change default open() behaviour or make it optional?

Guido would like to add an option and keep open() unchanged. He wrote that 
checking for BOM and using system locale are too much different to be the same 
option (encoding=None).

Antoine would like to check BOM by default, because both options (system 
locale vs checking for BOM) is the same thing.

About Antoine choice (encoding=None): which file modes would check for a BOM? 
I would like to answer only the read only mode, but then open(filename, "r") 
and open(filename, "r+") would behave differently?

  => 1 point for Guido (encoding="BOM" is more explicit)

Antoine choice has the advantage of directly support UTF-8+BOM, UTF-16 and 
UTF-32 for all applications and all modules using open(filename).

  => 1 point for Antoine


(2) Check for a BOM while reading or detect it before?

Everybody agree that checking BOM is an interesting option and should not be 
limited to open().

Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file 
name or a binary file-like object: it returns the encoding and seek to the 
file start or just after the BOM.

I dislike this function because it requires extra file operations (open 
(optional), read() and seek()) and it doesn't work if the file is not seekable 
(eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to 
avoid extra file operations.

Note: I implemented the BOM check in TextIOWrapper; so it's already usable for 
any file-like object.


(3) tell() and seek() on a text file starting with a BOM

To be consistent with Antoine example:

   >>> bio = io.BytesIO(b'\xff\xfea\x00b\x00')
   >>> f = io.TextIOWrapper(bio, encoding='utf-16')
   >>> f.read()
   'ab'
   >>> f.seek(0)
   0
   >>> f.read()
   'ab'

TextIOWrapper:

 * tell() should return zero at file start,
 * seek(0) should go be to file start,
 * and the BOM should always be "ignored".

I mean:

  with open("utf8bom.txt", encoding="BOM") as fp:
 assert fp.tell() == 0   
 text = fp.read() # no BOM here
 fp.seek(0)
 assert fp.read() == text

--

About my patch:

 - BOM check is explicit: open(filebame,  encoding="BOM")
 - tell() / seek(0) works as expected
 - BOM bytes are always skipped in read() / readlines() result

Hum, I don't know if this email can be called a sum up ;-)

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

Re: [Python-Dev] Quick sum up about open() + BOM

[Python-Dev] Quick sum up about open() + BOM

14 matches

Site Navigation

Mail list logo

Footer information