Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some
irritating gymnastics to write the BOM into the file before reading it
if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write() - it
doesn't write a BOM before reading from the file.
Yes, see, I read a *lot* of UTF-16 that comes from other sources.  It's 
not a matter of writing with python and reading with python.

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Nicholas Bastin wrote:
 
 On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
 
 The current implementation of the utf-16 codecs makes for some
 irritating gymnastics to write the BOM into the file before reading it
 if it contains no BOM, which seems quite like a bug in the codec.


 The codec writes a BOM in the first call to .write() - it
 doesn't write a BOM before reading from the file.
 
 
 Yes, see, I read a *lot* of UTF-16 that comes from other sources.  It's
 not a matter of writing with python and reading with python.

Ok, but I don't really follow you here: you are suggesting to
relax the current UTF-16 behavior and to start defaulting to
UTF-16-BE if no BOM is present - that's most likely going to
cause more problems that it seems to solve: namely complete
garbage if the data turns out to be UTF-16-LE encoded and,
what's worse, enters the application undetected.

If you do have UTF-16 without a BOM mark it's much better
to let a short function analyze the text by reading for first
few bytes of the file and then make an educated guess based
on the findings. You can then process the file using one
of the other codecs UTF-16-LE or -BE.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
Ok, but I don't really follow you here: you are suggesting to
relax the current UTF-16 behavior and to start defaulting to
UTF-16-BE if no BOM is present - that's most likely going to
cause more problems that it seems to solve: namely complete
garbage if the data turns out to be UTF-16-LE encoded and,
what's worse, enters the application undetected.
The crux of my argument is that the spec declares that UTF-16 without a 
BOM is BE.  If the file is encoded in UTF-16LE and it doesn't have a 
BOM, it doesn't deserve to be processed correctly.  That being said, 
treating it as UTF-16BE if it's LE will result in a lot of invalid code 
points, so it shouldn't be non-obvious that something has gone wrong.

If you do have UTF-16 without a BOM mark it's much better
to let a short function analyze the text by reading for first
few bytes of the file and then make an educated guess based
on the findings. You can then process the file using one
of the other codecs UTF-16-LE or -BE.
This is about what we do now - we catch UnicodeError and then add a BOM 
to the file, and read it again.  We know our files are UTF-16BE if they 
don't have a BOM, as the files are written by code which observes the 
spec.  We can't use UTF-16BE all the time, because sometimes they're 
UTF-16LE, and in those cases the BOM is set.

It would be nice if you could optionally specify that the codec would 
assume UTF-16BE if no BOM was present, and not raise UnicodeError in 
that case, which would preserve the current behaviour as well as allow 
users' to ask for behaviour which conforms to the standard.

I'm not saying that you can't work around the issue now, what I'm 
saying is that you shouldn't *have* to - I think there is a reasonable 
expectation that the UTF-16 codec conforms to the spec, and if you 
wanted it to do something else, it is those users who should be forced 
to come up with a workaround.

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Walter Dörwald
Nicholas Bastin sagte:

 On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:

 [...]
 If you do have UTF-16 without a BOM mark it's much better
 to let a short function analyze the text by reading for first
 few bytes of the file and then make an educated guess based
 on the findings. You can then process the file using one
 of the other codecs UTF-16-LE or -BE.

 This is about what we do now - we catch UnicodeError and
 then add a BOM  to the file, and read it again.  We know
 our files are UTF-16BE if they  don't have a BOM, as the
 files are written by code which observes the  spec.
 We can't use UTF-16BE all the time, because sometimes
 they're UTF-16LE, and in those cases the BOM is set.

 It would be nice if you could optionally specify that the
 codec would assume UTF-16BE if no BOM was present,
 and not raise UnicodeError in  that case, which would
 preserve the current behaviour as well as allow users'
 to ask for behaviour which conforms to the standard.

It should be feasible to implement your own codec for that
based on Lib/encodings/utf_16.py. Simply replace the line
in StreamReader.decode():
   raise UnicodeError,UTF-16 stream does not start with BOM
with:
   self.decode = codecs.utf_16_be_decode
and you should be done.

 [...]

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Martin v. Löwis
Nicholas Bastin wrote:
 It would be nice if you could optionally specify that the codec would
 assume UTF-16BE if no BOM was present, and not raise UnicodeError in
 that case, which would preserve the current behaviour as well as allow
 users' to ask for behaviour which conforms to the standard.

Alternatively, the UTF-16BE codec could support the BOM, and do
UTF-16LE if the other BOM is found.

This would also support your usecase, and in a better way. The
Unicode assertion that UTF-16 is BE by default is void these
days - there is *always* a higher layer protocol, and it more
often than not specifies (perhaps not in English words, but
only in the source code of the generator) that the default should
by LE.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
Of course it must be supported.  My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of what once was a stream.  It is
error-prone (not to mention unaesthetic) to not make that distinction.
Explicit is better than implicit.
I can't put these two paragraphs together. If you think that explicit
is better than implicit, why do you not want to make different calls
for the first chunk of a stream, and the subsequent chunks?
  s=cStringIO.StringIO()
  s1=codecs.getwriter(utf-8)(s)
  s1.write(uHallo)
  s.getvalue()
'Hallo'
Yes!  Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)!  ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.
However, if there is an utf-8-sig codec for streams, there is currently
no way of *preventing* this codec to also be available for strings. The
very same code is used for streams and for strings, and automatically
so.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Martin v. Löwis sagte:
 Walter Dörwald wrote:
 There are situations where the byte stream might be temporarily
 exhausted, e.g. an XML parser that tries to support the
 IncrementalParser interface, or when you want to decode
 encoded data piecewise, because you want to give a progress
 report.

 Yes, but these are not file-like objects.

True, on the outside there are no file-like objects. But the
IncrementalParser gets passed the XML bytes in chunks,
so it has to use a stateful decoder for decoding. Unfortunately
this means that is has to use a stream API. (See
http://www.python.org/sf/1101097 for a patch that somewhat
fixes that.)

(Another option would be to completely ignore the stateful API
and handcraft stateful decoding (or only support stateless
decoding), like most XML parsers for Python do now.)

 In the IncrementalParser,
 it is *not* the case that a read operation returns an empty
 string. Instead, the application repeatedly feeds data explicitly.

That's true, but the parser has to wrap this data into an object
that can be passed to the StreamReader constructor. (See the
Queue class in Lib/test/test_codecs.py for an example.)

 For a file-like object, returning  indicates EOF.

Not neccassarily. In the example above the IncrementalParser
gets fed a chunk of data, it stuffs this data into the Queue,
so that the StreamReader can decode it. Once the data
from the Queue is exhausted, there won't any further
data until the user calls feed() on the IncrementalParser again.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Stephen J. Turnbull wrote:
Martin == Martin v Löwis [EMAIL PROTECTED] writes:
Martin I can't put these two paragraphs together. If you think
Martin that explicit is better than implicit, why do you not want
Martin to make different calls for the first chunk of a stream,
Martin and the subsequent chunks?
Because the signature/BOM is not a chunk, it's a header.  Handling the
signature/BOM is part of stream initialization, not translation, to my
mind.
The point is that explicitly using a stream shows that initialization
(and finalization) matter.  The default can be BOM or not, as a
pragmatic matter.  But then the stream data itself can be treated
homogeneously, as implied by the notion of stream.
I think it probably also would solve Walter's conundrum about
buffering the signature/BOM if responsibility for that were moved out
of the codecs and into the objects where signatures make sense.
Not really. In every encoding where a sequence of more than one byte 
maps to one Unicode character, you will always need some kind of 
buffering. If we remove the handling of initial BOMs from the codecs 
(except for UTF-16 where it is required), this wouldn't change any 
buffering requirements.

I don't know whether that's really feasible in the short run---I
suspect there may be a lot of stream-like modules that would need to
be updated---but it would be a saner in the long run.
I'm not exactly sure, what you're proposing here. That all codecs (even 
UTF-16) pass the BOM through and some other infrastructure is 
responsible for dropping it?

[...]
Bye,
   Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Nicholas Bastin
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.
I've actually been confused about this point for quite some time now, 
but never had a chance to bring it up.  I do not understand why 
UnicodeError should be raised if there is no BOM.  I know that PEP-100 
says:

'utf-16': 16-bit variable length encoding (little/big 
endian)

and:
Note: 'utf-16' should be implemented by using and requiring byte order 
marks (BOM) for file input/output.

But this appears to be in error, at least in the current unicode 
standard.  'utf-16', as defined by the unicode standard, is big-endian 
in the absence of a BOM:

---
3.10.D42:  UTF-16 encoding scheme:
...
* The UTF-16 encoding scheme may or may not begin with a BOM.  However, 
when there is no BOM, and in the absence of a higher-level protocol, 
the byte order of the UTF-16 encoding scheme is big-endian.
---

The current implementation of the utf-16 codecs makes for some 
irritating gymnastics to write the BOM into the file before reading it 
if it contains no BOM, which seems quite like a bug in the codec.  I 
allow for the possibility that this was ambiguous in the standard when 
the PEP was written, but it is certainly not ambiguous now.

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Stephen J. Turnbull
 Walter == Walter Dörwald [EMAIL PROTECTED] writes:

Walter Not really. In every encoding where a sequence of more
Walter than one byte maps to one Unicode character, you will
Walter always need some kind of buffering. If we remove the
Walter handling of initial BOMs from the codecs (except for
Walter UTF-16 where it is required), this wouldn't change any
Walter buffering requirements.

Sure.  My point is that codecs should be stateful only to the extent
needed to assemble semantically meaningful units (ie, multioctet coded
characters).  In particular, they should not need to know about
location at the beginning, middle, or end of some stream---because in
the context of operating on a string they _can't_.

 I don't know whether that's really feasible in the short
 run---I suspect there may be a lot of stream-like modules that
 would need to be updated---but it would be a saner in the long
 run.

Walter I'm not exactly sure, what you're proposing here. That all
Walter codecs (even UTF-16) pass the BOM through and some other
Walter infrastructure is responsible for dropping it?

Not exactly.  I think that at the lowest level codecs should not
implement complex mode-switching internally, but rather explicitly
abdicate responsibility to a more appropriate codec.

For example, autodetecting UTF-16 on input would be implemented by a
Python program that does something like

data = stream.read()
for detector in [ utf-16-signature, utf-16-statistical ]:
# for the UTF-16 detectors, OUT will always be u or None
out, data, codec = data.decode(detector)
if codec: break
while codec:
more_out, data, codec = data.decode(codec)
out = out + more_out
if data:
# a real program would complain about it
pass
process(out)

where decode(utf-16-signature) would be implemented

def utf-16-signature-internal (data):
if data[0:2] == \xfe\xff:
return (u, data[2:], utf-16-be)
else if data[0:2] == \xff\xfe:
return (u, data[2:], utf-16-le)
else
# note: data is undisturbed if the detector fails
return (None, data, None)

The main point is that the detector is just a codec that stops when it
figures out what the next codec should be, touches only data that
would be incorrect to pass to the next codec, and leaves the data
alone if detection fails.  utf-16-signature only handles the BOM (if
present), and does not handle arbitrary chunks of data.  Instead, it
passes on the rest of the data (including the first chunk) to be
handled by the appropriate utf-16-?e codec.

I think that the temptation to encapsulate this logic in a utf-16
codec that simplifies things by calling the appropriate utf-16-?e
codec itself should be deprecated, but YMMV.  What I would really like
is for the above style to be easier to achieve than it currently is.

BTW, I appreciate your patience in exploring this; after Martin's
remark about different mental models I have to suspect this approach
is just somehow un-Pythonic, but fleshing it out this way I can see
how it will be useful in the context of a different project.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
 MAL == M  [EMAIL PROTECTED] writes:

MAL The BOM (byte order mark) was a non-standard Microsoft
MAL invention to detect Unicode text data as such (MS always uses
MAL UTF-16-LE for Unicode text files).

The Japanese memopado (Notepad) uses UTF-8 signatures; it even adds
them to existing UTF-8 files lacking them.

MAL -1; there's no standard for UTF-8 BOMs - adding it to the
MAL codecs module was probably a mistake to begin with. You
MAL usually only get UTF-8 files with BOM marks as the result of
MAL recoding UTF-16 files into UTF-8.

There is a standard for UTF-8 _signatures_, however.  I don't have the
most recent version of the ISO-10646 standard, but Amendment 2 (which
defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
Annex F of that standard.  Evan quotes Version 4 of the Unicode
standard, which explicitly defines the UTF-8 signature.

So there is a standard for the UTF-8 signature, and I know of
applications which produce it.  While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.

However, this option should be part of the initialization of an IO
stream which produces Unicodes, _not_ an operation on arbitrary
internal strings (whether raw or Unicode).

MAL BTW, how do you know that s came from the start of a file and
MAL not from slicing some already loaded file somewhere in the
MAL middle ?

The programmer or the application might, but Python's codecs don't.
The point is that this is also true of rawstrings that happen to
contain UTF-16 or UTF-32 data.  The UTF-16 (auto-endian) codec
shouldn't strip leading BOMs either, unless it has been told it has
the beginning of the string.

MAL Evan Jones wrote:

 This is *not* a valid Unicode character. The Unicode
 specification (version 4, section 15.8) says the following
 about non-characters:
 
 Applications are free to use any of these noncharacter code
 points internally but should never attempt to exchange
 them. If a noncharacter is received in open interchange, an
 application is not required to interpret it in any way. It is
 good practice, however, to recognize it as a noncharacter and
 to take appropriate action, such as removing it from the
 text. Note that Unicode conformance freely allows the removal
 of these characters. (See C10 in Section3.2, Conformance
 Requirements.)
 
 My interpretation of the specification means that Python should

The specification _permits_ silent removal; it does not recommend.

 silently remove the character, resulting in a zero length
 Unicode string.  Similarly, both of the following lines should
 also result in a zero length Unicode string:

 '\xff\xfe\xfe\xff'.decode( utf16 )
 u'\ufffe'
 '\xff\xfe\xff\xff'.decode( utf16 )
 u'\u'

I strongly disagree; these decisions should be left to a higher layer.
In the case of specified UTFs, the codecs should simply invert the UTF
to Python's internal encoding.

MAL Hmm, wouldn't it be better to raise an error ? After all, a
MAL reversed BOM mark in the stream looks a lot like you're
MAL trying to decode a UTF-16 stream assuming the wrong byte
MAL order ?!

+1 on (optionally) raising an error.  -1 on removing it or anything
like that, unless under control of the application (ie, the program
written in Python, not Python itself).  It's far too easy for software
to generate broken Unicode streams[1], and the choice of how to deal
with those should be with the application, not with the implementation
language.



Footnotes: 
[1]  An egregious example was the Outlook Express distributed with
early Win2k betas, which produced MIME bodies with apparent
Content-Type: text/html; charset=utf-16, but the HTML tags and
newlines were 7-bit ASCII!

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of
applications which produce it.  While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
I would personally like to see an utf-8-bom codec (perhaps better
named utf-8-sig, which strips the BOM on reading (if present)
and generates it on writing.
However, this option should be part of the initialization of an IO
stream which produces Unicodes, _not_ an operation on arbitrary
internal strings (whether raw or Unicode).
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 Stephen J. Turnbull wrote:
 
 So there is a standard for the UTF-8 signature, and I know of
 applications which produce it.  While I agree with you that Python's
 codecs shouldn't produce it (by default), providing an option to strip
 is a good idea.
 
 I would personally like to see an utf-8-bom codec (perhaps better
 named utf-8-sig, which strips the BOM on reading (if present)
 and generates it on writing.

+1.

 However, this option should be part of the initialization of an IO
 stream which produces Unicodes, _not_ an operation on arbitrary
 internal strings (whether raw or Unicode).
 
 
 With the UTF-8-SIG codec, it would apply to all operation modes of
 the codec, whether stream-based or from strings. Whether or not to
 use the codec would be the application's choice.

I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes the BOM mark on the first call
to the StreamReader .decode() method and writes a BOM mark
on the first call to .encode() on a StreamWriter.

Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 05 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
M.-A. Lemburg wrote:
[...]
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes the BOM mark on the first call
to the StreamReader .decode() method and writes a BOM mark
on the first call to .encode() on a StreamWriter.
Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.
I've started writing such a codec. Making the BOM optional on decoding 
definitely simplifies the implementation.

Bye,
   Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Stephen J. Turnbull wrote:
MAL == M  [EMAIL PROTECTED] writes:
 
 
 MAL The BOM (byte order mark) was a non-standard Microsoft
 MAL invention to detect Unicode text data as such (MS always uses
 MAL UTF-16-LE for Unicode text files).
 
 The Japanese memopado (Notepad) uses UTF-8 signatures; it even adds
 them to existing UTF-8 files lacking them.

Is that a MS application ? AFAIK, notepad, wordpad and MS Office
always use UTF-16-LE + BOM when saving text as Unicode text.

 MAL -1; there's no standard for UTF-8 BOMs - adding it to the
 MAL codecs module was probably a mistake to begin with. You
 MAL usually only get UTF-8 files with BOM marks as the result of
 MAL recoding UTF-16 files into UTF-8.
 
 There is a standard for UTF-8 _signatures_, however.  I don't have the
 most recent version of the ISO-10646 standard, but Amendment 2 (which
 defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
 Annex F of that standard.  Evan quotes Version 4 of the Unicode
 standard, which explicitly defines the UTF-8 signature.

Ok, as signature the BOM does make some sense - whether to
strip signatures from a document is a good idea or not
is a different matter, though.

Here's the Unicode Cons. FAQ on the subject:

http://www.unicode.org/faq/utf_bom.html#22

They also explicitly warn about adding BOMs to UTF-8 data
since it can break applications and protocols that do not
expect such a signature.

 So there is a standard for the UTF-8 signature, and I know of
 applications which produce it.  While I agree with you that Python's
 codecs shouldn't produce it (by default), providing an option to strip
 is a good idea.
 
 However, this option should be part of the initialization of an IO
 stream which produces Unicodes, _not_ an operation on arbitrary
 internal strings (whether raw or Unicode).

Right.

 MAL BTW, how do you know that s came from the start of a file and
 MAL not from slicing some already loaded file somewhere in the
 MAL middle ?
 
 The programmer or the application might, but Python's codecs don't.
 The point is that this is also true of rawstrings that happen to
 contain UTF-16 or UTF-32 data.  The UTF-16 (auto-endian) codec
 shouldn't strip leading BOMs either, unless it has been told it has
 the beginning of the string.

The UTF-16 stream codecs implement this logic.

The UTF-16 encode and decode functions will however always strip
the BOM mark from the beginning of a string.

If the application doesn't want this stripping to happen,
it should use the UTF-16-LE or -BE codec resp.

 MAL Evan Jones wrote:
 
  This is *not* a valid Unicode character. The Unicode
  specification (version 4, section 15.8) says the following
  about non-characters:
  
  Applications are free to use any of these noncharacter code
  points internally but should never attempt to exchange
  them. If a noncharacter is received in open interchange, an
  application is not required to interpret it in any way. It is
  good practice, however, to recognize it as a noncharacter and
  to take appropriate action, such as removing it from the
  text. Note that Unicode conformance freely allows the removal
  of these characters. (See C10 in Section3.2, Conformance
  Requirements.)
  
  My interpretation of the specification means that Python should
 
 The specification _permits_ silent removal; it does not recommend.
 
  silently remove the character, resulting in a zero length
  Unicode string.  Similarly, both of the following lines should
  also result in a zero length Unicode string:
 
  '\xff\xfe\xfe\xff'.decode( utf16 )
  u'\ufffe'
  '\xff\xfe\xff\xff'.decode( utf16 )
  u'\u'
 
 I strongly disagree; these decisions should be left to a higher layer.
 In the case of specified UTFs, the codecs should simply invert the UTF
 to Python's internal encoding.
 
 MAL Hmm, wouldn't it be better to raise an error ? After all, a
 MAL reversed BOM mark in the stream looks a lot like you're
 MAL trying to decode a UTF-16 stream assuming the wrong byte
 MAL order ?!
 
 +1 on (optionally) raising an error. 

The advantage of raising an error is that the application
can deal with the situation in whatever way seems fit (by
registering a special error handler or by simply using
ignore or replace).

I agree that much of this lies outside the scope of codecs
and should be handled at an application or protocol level.

 -1 on removing it or anything
 like that, unless under control of the application (ie, the program
 written in Python, not Python itself).  It's far too easy for software
 to generate broken Unicode streams[1], and the choice of how to deal
 with those should be with the application, not with the implementation
 language.
 
 Footnotes: 
 [1]  An egregious example was the Outlook Express distributed with
 early Win2k betas, which produced MIME bodies with apparent
 

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
 Martin == Martin v Löwis [EMAIL PROTECTED] writes:

Martin Stephen J. Turnbull wrote:

 However, this option should be part of the initialization of an
 IO stream which produces Unicodes, _not_ an operation on
 arbitrary internal strings (whether raw or Unicode).

Martin With the UTF-8-SIG codec, it would apply to all operation
Martin modes of the codec, whether stream-based or from strings.

I had in mind the ability to treat a string as a stream.

Martin Whether or not to use the codec would be the application's
Martin choice.

What I think should be provided is a stateful object encapsulating the
codec.  Ie, to avoid the need to write

out = chunk[0].encode(utf-8-sig) + chunk[1].encode(utf-8)



-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Walter Dörwald sagte:

 M.-A. Lemburg wrote:

 [...]
With the UTF-8-SIG codec, it would apply to all operation
 modes of the codec, whether stream-based or from strings. Whether
or not to use the codec would be the application's choice.

 I'd suggest to use the same mode of operation as we have in
 the UTF-16 codec: it removes the BOM mark on the first call
 to the StreamReader .decode() method and writes a BOM mark
 on the first call to .encode() on a StreamWriter.

 Note that the UTF-16 codec is strict w/r to the presence
 of the BOM mark: you get a UnicodeError if a stream does
 not start with a BOM mark. For the UTF-8-SIG codec, this
 should probably be relaxed to not require the BOM.

 I've started writing such a codec. Making the BOM optional
 on decoding definitely simplifies the implementation.

OK, here is the patch: http://www.python.org/sf/1177307

The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains ab, the user will
never see these two characters.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Evan Jones
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains ab, the user will
never see these two characters.
Shouldn't the decoder be capable of doing a partial match and quitting 
early? After all, ab is encoded in UTF8 as 61 62 but the BOM is 
ef bb bf. If it did this type of partial matching, this issue 
would be avoided except in rare situations.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.
This functionality is provided by a flush() method on similar objects, 
such as the zlib compression objects.

Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Fred Drake
On Tuesday 05 April 2005 15:53, Evan Jones wrote:
  This functionality is provided by a flush() method on similar objects,
  such as the zlib compression objects.

Or by close() on other objects (htmllib, HTMLParser, the SAX incremental 
parser, etc.).

Too bad there's more than one way to do it.  :-(


  -Fred

-- 
Fred L. Drake, Jr.fdrake at acm.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains ab, the user will
never see these two characters.
This can be improved, of course: If the first byte is a, it most
definitely is *not* an UTF-8 signature.
So we only need a second byte for the characters between U+F000
and U+, and a third byte only for the characters
U+FEC0...U+FEFF. But with the first byte being  \xef, we need
three bytes *anyway*, so we can always decide with the first
byte only whether we need to wait for three bytes.
A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.
Shouldn't an empty read from the underlying stream be taken
as an EOF?
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Martin v. Löwis sagte:
 Walter Dörwald wrote:
 The stateful decoder has a little problem: At least three bytes
 have to be available from the stream until the StreamReader
 decides whether these bytes are a BOM that has to be skipped.
 This means that if the file only contains ab, the user will
 never see these two characters.

 This can be improved, of course: If the first byte is a,
 it most definitely is *not* an UTF-8 signature.

 So we only need a second byte for the characters between U+F000
 and U+, and a third byte only for the characters
 U+FEC0...U+FEFF. But with the first byte being  \xef, we need
 three bytes *anyway*, so we can always decide with the first
 byte only whether we need to wait for three bytes.

OK, I've updated the patch so that the first bytes will only be kept
in the buffer if they are a prefix of the BOM.

 A solution for this would be to add an argument named final to
 the decode and read methods that tells the decoder that the
 stream has ended and the remaining buffered bytes have to be
 handled now.

 Shouldn't an empty read from the underlying stream be taken
 as an EOF?

There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Evan Jones sagte:
 On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
 The stateful decoder has a little problem: At least three bytes
 have to be available from the stream until the StreamReader
 decides whether these bytes are a BOM that has to be skipped.
 This means that if the file only contains ab, the user will
 never see these two characters.

 Shouldn't the decoder be capable of doing a partial match and quitting  
 early? After all, ab is encoded in UTF8 as 61
 62 but the BOM is  ef bb bf. If it did this type of partial matching, 
 this issue  would be avoided except in rare
 situations.

 A solution for this would be to add an argument named final to
 the decode and read methods that tells the decoder that the
 stream has ended and the remaining buffered bytes have to be
 handled now.

 This functionality is provided by a flush() method on similar objects,  such 
 as the zlib compression objects.

Theoretically the name is unimportant, but read(..., final=True) or flush()
or close() should subject the pending bytes to normal error handling and
must return the result of decoding these pending bytes just like the
other methods do. This would mean that we would have to implement
a decodecode(), a readclose() and a readlineclose(). IMHO it would be
best to add this argument to decode, read and readline directly. But I'm
not sure, what this would mean for iterating through a StreamReader.

Bye,
Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.
Yes, but these are not file-like objects. In the IncrementalParser,
it is *not* the case that a read operation returns an empty
string. Instead, the application repeatedly feeds data explicitly.
For a file-like object, returning  indicates EOF.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
I recently rediscovered this strange behaviour in Python's Unicode 
handling. I *think* it is a bug, but before I go and try to hack 
together a patch, I figure I should run it by the experts here on 
Python-Dev. If you understand Unicode, please let me know if there are 
problems with making these minor changes.

 import codecs
 codecs.BOM_UTF8.decode( utf8 )
u'\ufeff'
 codecs.BOM_UTF16.decode( utf16 )
u''
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder 
turns it into a character? The UTF-16 decoder contains logic to 
correctly handle the BOM. It even handles byte swapping, if necessary. 
I propose that  the UTF-8 decoder should have the same logic: it should 
remove the BOM if it is detected at the beginning of a string. This 
will remove a bit of manual work for Python programs that deal with 
UTF-8 files created on Windows, which frequently have the BOM at the 
beginning. The Unicode standard is unclear about how it should be 
handled (version 4, section 15.9):

Although there are never any questions of byte order with UTF-8 text, 
this sequence can serve as signature for UTF-8 encoded text where the 
character set is unmarked. [...] Systems that use the byte order mark 
must recognize when an initial U+FEFF signals the byte order. In those 
cases, it is not part of the textual content and should be removed 
before processing, because otherwise it may be mistaken for a 
legitimate zero width no-break space.
At the very least, it would be nice to add a note about this to the 
documentation, and possibly add this example function that implements 
the UTF-8 or ASCII? logic:

def autodecode( s ):
if s.beginswith( codecs.BOM_UTF8 ):
# The byte string s is UTF-8
out = s.decode( utf8 )
return out[1:]
else: return s.decode( ascii )
As a second issue, the UTF-16LE and UTF-16BE encoders almost do the 
right thing: They turn the BOM into a character, just like the Unicode 
specification says they should.

 codecs.BOM_UTF16_LE.decode( utf-16le )
u'\ufeff'
 codecs.BOM_UTF16_BE.decode( utf-16be )
u'\ufeff'
However, they also *incorrectly* handle the reversed byte order mark:
 codecs.BOM_UTF16_BE.decode( utf-16le )
u'\ufffe'
This is *not* a valid Unicode character. The Unicode specification 
(version 4, section 15.8) says the following about non-characters:

Applications are free to use any of these noncharacter code points 
internally but should never attempt to exchange them. If a 
noncharacter is received in open interchange, an application is not 
required to interpret it in any way. It is good practice, however, to 
recognize it as a noncharacter and to take appropriate action, such as 
removing it from the text. Note that Unicode conformance freely allows 
the removal of these characters. (See C10 in Section3.2, Conformance 
Requirements.)
My interpretation of the specification means that Python should 
silently remove the character, resulting in a zero length Unicode 
string. Similarly, both of the following lines should also result in a 
zero length Unicode string:

 '\xff\xfe\xfe\xff'.decode( utf16 )
u'\ufffe'
 '\xff\xfe\xff\xff'.decode( utf16 )
u'\u'
Thanks for your feedback,
Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread M.-A. Lemburg
Evan Jones wrote:
 I recently rediscovered this strange behaviour in Python's Unicode
 handling. I *think* it is a bug, but before I go and try to hack
 together a patch, I figure I should run it by the experts here on
 Python-Dev. If you understand Unicode, please let me know if there are
 problems with making these minor changes.
 
 
 import codecs
 codecs.BOM_UTF8.decode( utf8 )
 u'\ufeff'
 codecs.BOM_UTF16.decode( utf16 )
 u''
 
 Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder
 turns it into a character? 

The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).

It is not needed for the UTF-8 because that format doesn't rely on
the byte order and the BOM character at the beginning of a stream is
a legitimate ZWNBSP (zero width non breakable space) code point.

The utf-16 codec detects and removes the mark, while the
two others utf-16-le (little endian byte order) and utf-16-be
(big endian byte order) don't.

 The UTF-16 decoder contains logic to
 correctly handle the BOM. It even handles byte swapping, if necessary. I
 propose that  the UTF-8 decoder should have the same logic: it should
 remove the BOM if it is detected at the beginning of a string. 

-1; there's no standard for UTF-8 BOMs - adding it to the
codecs module was probably a mistake to begin with. You usually
only get UTF-8 files with BOM marks as the result of recoding
UTF-16 files into UTF-8.

 This will
 remove a bit of manual work for Python programs that deal with UTF-8
 files created on Windows, which frequently have the BOM at the
 beginning. The Unicode standard is unclear about how it should be
 handled (version 4, section 15.9):
 
 Although there are never any questions of byte order with UTF-8 text,
 this sequence can serve as signature for UTF-8 encoded text where the
 character set is unmarked. [...] Systems that use the byte order mark
 must recognize when an initial U+FEFF signals the byte order. In those
 cases, it is not part of the textual content and should be removed
 before processing, because otherwise it may be mistaken for a
 legitimate zero width no-break space.
 
 
 At the very least, it would be nice to add a note about this to the
 documentation, and possibly add this example function that implements
 the UTF-8 or ASCII? logic:
 
 def autodecode( s ):
 if s.beginswith( codecs.BOM_UTF8 ):
 # The byte string s is UTF-8
 out = s.decode( utf8 )
 return out[1:]
 else: return s.decode( ascii )

Well, I'd say that's a very English way of dealing with encoded
text ;-)

BTW, how do you know that s came from the start of a file
and not from slicing some already loaded file somewhere
in the middle ?

 As a second issue, the UTF-16LE and UTF-16BE encoders almost do the
 right thing: They turn the BOM into a character, just like the Unicode
 specification says they should.
 
 codecs.BOM_UTF16_LE.decode( utf-16le )
 u'\ufeff'
 codecs.BOM_UTF16_BE.decode( utf-16be )
 u'\ufeff'
 
 However, they also *incorrectly* handle the reversed byte order mark:
 
 codecs.BOM_UTF16_BE.decode( utf-16le )
 u'\ufffe'
 
 This is *not* a valid Unicode character. The Unicode specification
 (version 4, section 15.8) says the following about non-characters:
 
 Applications are free to use any of these noncharacter code points
 internally but should never attempt to exchange them. If a
 noncharacter is received in open interchange, an application is not
 required to interpret it in any way. It is good practice, however, to
 recognize it as a noncharacter and to take appropriate action, such as
 removing it from the text. Note that Unicode conformance freely allows
 the removal of these characters. (See C10 in Section3.2, Conformance
 Requirements.)
 
 
 My interpretation of the specification means that Python should silently
 remove the character, resulting in a zero length Unicode string.
 Similarly, both of the following lines should also result in a zero
 length Unicode string:
 
 '\xff\xfe\xfe\xff'.decode( utf16 )
 u'\ufffe'
 '\xff\xfe\xff\xff'.decode( utf16 )
 u'\u'

Hmm, wouldn't it be better to raise an error ? After all,
a reversed BOM mark in the stream looks a lot like you're
trying to decode a UTF-16 stream assuming the wrong
byte order ?!

Other than that: +1 on fixing this case.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 01 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:
The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).
Well, it's origins do not really matter since at this point the BOM is 
firmly encoded in the Unicode standard. It seems to me that it is in 
everyone's best interest to support it.

It is not needed for the UTF-8 because that format doesn't rely on
the byte order and the BOM character at the beginning of a stream is
a legitimate ZWNBSP (zero width non breakable space) code point.
You are correct: it is a legitimate character. However, its use as a 
ZWNBSP character has been deprecated:

The overloading of semantics for this code point has caused problems 
for programs and protocols. The new character U+2060 WORD JOINER has 
the same semantics in all cases as U+FEFF, except that it cannot be 
used as a signature. Implementers are strongly encouraged to use word 
joiner in those circumstances whenever word joining semantics is 
intended.
Also, the Unicode specification is ambiguous on what an implementation 
should do about a leading ZWNBSP that is encoded in UTF-16. Like I 
mentioned, if you look at the Unicode standard, version 4, section 
15.9, it says:

2. Unmarked Character Set. In some circumstances, the character set 
information for a stream of coded characters (such as a file) is not 
available. The only information available is that the stream contains 
text, but the precise character set is not known.
This seems to indicate that it is permitted to strip the BOM from the 
beginning of UTF-8 text.

-1; there's no standard for UTF-8 BOMs - adding it to the
codecs module was probably a mistake to begin with. You usually
only get UTF-8 files with BOM marks as the result of recoding
UTF-16 files into UTF-8.
This is clearly incorrect. The UTF-8 is specified in the Unicode 
standard version 4, section 15.9:

In UTF-8, the BOM corresponds to the byte sequence EF BB BF.
I normally find files with UTF-8 BOMs from many Windows applications 
when you save a text file as UTF8. I think that Notepad or WordPad does 
this, for example. I think UltraEdit also does the same thing. I know 
that Scintilla definitely does.

At the very least, it would be nice to add a note about this to the
documentation, and possibly add this example function that implements
the UTF-8 or ASCII? logic.
Well, I'd say that's a very English way of dealing with encoded
text ;-)
Please note I am saying only that something like this may want to me 
considered for addition to the documentation, and not to the Python 
standard library. This example function more closely replicates the 
logic that is used on those Windows applications when opening .txt 
files. It uses the default locale if there is no BOM:

def autodecode( s ):
if s.beginswith( codecs.BOM_UTF8 ):
# The byte string s is UTF-8
out = s.decode( utf8 )
return out[1:]
else: return s.decode()
BTW, how do you know that s came from the start of a file
and not from slicing some already loaded file somewhere
in the middle ?
Well, the same argument could be applied to the UTF-16 decoder know 
that the string came from the start of a file, and not from slicing 
some already loaded file? The standard states that:

In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file 
or stream explicitly signals the byte order.
So it is perfectly permissible to perform this type of processing if 
you consider a string to be equivalent to a stream.

My interpretation of the specification means that Python should 
silently
remove the character, resulting in a zero length Unicode string.
Hmm, wouldn't it be better to raise an error ? After all,
a reversed BOM mark in the stream looks a lot like you're
trying to decode a UTF-16 stream assuming the wrong
byte order ?!
Well, either one is possible, however the Unicode standard suggests, 
but does not require, silently removing them:

It is good practice, however, to recognize it as a noncharacter and to 
take appropriate action, such as removing it from the text. Note that 
Unicode conformance freely allows the removal of these characters.
I would prefer silently ignoring them from the str.decode() function, 
since I believe in be strict in what you emit, but liberal in what you 
accept. I think that this only applies to str.decode(). Any other 
attempt to create non-characters, such as unichr( 0x ), *should* 
raise an exception because clearly the programmer is making a mistake.

Other than that: +1 on fixing this case.
Cool!
Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com