[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

2015-10-06 Thread R. David Murray

Changes by R. David Murray :


--
resolution:  -> not a bug
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

2015-10-06 Thread Daniel Blanchard

Daniel Blanchard added the comment:

Thanks for straightening me out there! I had not noticed this in the Unicode 
FAQ before:

>  Where the data has an associated type, such as a field in a database, a BOM 
> is unnecessary. In particular, if a text data stream is marked as UTF-16BE, 
> UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any 
> U+FEFF would be interpreted as a ZWNBSP.

Anyway, the thing that brought this up is that in chardet we detect codecs of 
files for people and we've been returning UTF-16BE or UTF-16LE when we detect 
the BOM at the front of the file, but we recently learned that if people tried 
to decode with those codecs things don't work as expected.  It seems the 
correct behavior in our case is to just return UTF-16 in these cases.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

2015-10-06 Thread eryksun

eryksun added the comment:

Yes, if you explicitly use big-ending or little-endian UTF, then you need to 
manually include a BOM if that's required. That said, if a file format or data 
field is specified with a particular byte order, then using a BOM is strictly 
incorrect. See the UTF BOM FAQ:

http://www.unicode.org/faq/utf_bom.html#BOM

For regular text documents, in which the byte order doesn't really matter, use 
the native byte order of your platform via UTF-16 or UTF-32. Also, instead of 
manually encoding strings, use the "encoding" parameter of the built-in open 
function, or io.open or codecs.open in Python 2. This only writes a single BOM, 
even when writing to a file multiple times.

--
nosy: +eryksun
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

2015-10-06 Thread R. David Murray

R. David Murray added the comment:

eryksun beat me to the answer, but I'm going to post mine anyway :)

If I understand the codecs docs correctly, this is because if you are 
specifying the endianess you want, it is a sign that you are only going to 
interpret it as that endianness, so there's no need for a BOM.  If you want a 
BOM, use utf-16/32.

In short, what is your use case for producing a UTF string with non-native byte 
order?  But as eryksun said, the Python supported way to do that and include a 
BOM is to write the BOM yourself.

--
nosy: +lemburg, r.david.murray -eryksun
resolution: not a bug -> 
stage: resolved -> 
status: closed -> open

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

2015-10-06 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Just to add some more background:

The LE and BE codecs are meant to be used when you already know the endianness 
of the platform you are targeting, e.g. in case you work on strings that were 
read after the initial BOM, or write to an output string in chunks after having 
written the initial BOM. As such, they don't treat the BOM special, since it is 
a valid code point, and pass it through as-is.

If you do want BOM handling, the UTF-16 codec is the right choice. It defaults 
to the platform's endianness and uses the BOM to indicate which choice it made.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

2015-10-06 Thread R. David Murray

Changes by R. David Murray :


--
stage:  -> resolved

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

2015-10-06 Thread Daniel Blanchard

New submission from Daniel Blanchard:

As I recently discovered when someone filed a PR on chardet (see 
https://github.com/chardet/chardet/issues/70), BOMs are handled are not handled 
correctly by the endian-specific encodings UTF-16LE, UTF-16BE, UTF-32LE, and 
UTF-32BE, but are by the UTF-16 and UTF-32 encodings.

For example:

>>> 'foo'.encode('utf-16le')
b'f\x00o\x00o\x00'
>>> 'foo'.encode('utf-16')
b'\xff\xfef\x00o\x00o\x00'

You can see that when using UTF-16 (instead of UTF-16LE), you get the BOM 
correctly prepended to the bytes.

If you were on a little endian system and purposefully wanted to create a 
UTF-16BE file, the only way to do it is:

>>> codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')
b'\xfe\xff\x00f\x00o\x00o'

This doesn't make a lot of sense to me.  Why is the BOM not prepended 
automatically when encoding with UTF-16BE?

Furthermore, if you were given a UTF-16BE file on a little endian system, you 
might think that this would be the correct way to decode it:

>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16be')
'\ufefffoo'

but as you can see that leaves the BOM on there.  Strangely, decoding with 
UTF-16 works fine however:

>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16')
'foo'

It seems to me that the endian-specific versions of UTF-16 and UTF-32 should be 
adding/removing the appropriate BOMs, and this is a long-standing bug.

--
components: Unicode
messages: 252406
nosy: Daniel.Blanchard, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove 
BOM on encode/decode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com