On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some
irritating gymnastics to write the BOM into the file before reading it
if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to
Nicholas Bastin wrote:
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some
irritating gymnastics to write the BOM into the file before reading it
if it contains no BOM, which seems quite like a bug in the codec.
The codec
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
Ok, but I don't really follow you here: you are suggesting to
relax the current UTF-16 behavior and to start defaulting to
UTF-16-BE if no BOM is present - that's most likely going to
cause more problems that it seems to solve: namely complete
Nicholas Bastin sagte:
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
[...]
If you do have UTF-16 without a BOM mark it's much better
to let a short function analyze the text by reading for first
few bytes of the file and then make an educated guess based
on the findings. You can then
Nicholas Bastin wrote:
It would be nice if you could optionally specify that the codec would
assume UTF-16BE if no BOM was present, and not raise UnicodeError in
that case, which would preserve the current behaviour as well as allow
users' to ask for behaviour which conforms to the standard.
Stephen J. Turnbull wrote:
Of course it must be supported. My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of what once was a stream. It
Martin v. Löwis sagte:
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.
Stephen J. Turnbull wrote:
Martin == Martin v Löwis [EMAIL PROTECTED] writes:
Martin I can't put these two paragraphs together. If you think
Martin that explicit is better than implicit, why do you not want
Martin to make different calls for the first chunk of a stream,
Martin and
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.
I've actually been confused
Walter == Walter Dörwald [EMAIL PROTECTED] writes:
Walter Not really. In every encoding where a sequence of more
Walter than one byte maps to one Unicode character, you will
Walter always need some kind of buffering. If we remove the
Walter handling of initial BOMs from the
MAL == M [EMAIL PROTECTED] writes:
MAL The BOM (byte order mark) was a non-standard Microsoft
MAL invention to detect Unicode text data as such (MS always uses
MAL UTF-16-LE for Unicode text files).
The Japanese memopado (Notepad) uses UTF-8 signatures; it even adds
them to
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of
applications which produce it. While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
I would personally like to see an utf-8-bom codec
Martin v. Löwis wrote:
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of
applications which produce it. While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
I would personally
M.-A. Lemburg wrote:
[...]
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes
Stephen J. Turnbull wrote:
MAL == M [EMAIL PROTECTED] writes:
MAL The BOM (byte order mark) was a non-standard Microsoft
MAL invention to detect Unicode text data as such (MS always uses
MAL UTF-16-LE for Unicode text files).
The Japanese memopado (Notepad) uses UTF-8
Martin == Martin v Löwis [EMAIL PROTECTED] writes:
Martin Stephen J. Turnbull wrote:
However, this option should be part of the initialization of an
IO stream which produces Unicodes, _not_ an operation on
arbitrary internal strings (whether raw or Unicode).
Martin With
Walter Dörwald sagte:
M.-A. Lemburg wrote:
[...]
With the UTF-8-SIG codec, it would apply to all operation
modes of the codec, whether stream-based or from strings. Whether
or not to use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains ab, the user will
On Tuesday 05 April 2005 15:53, Evan Jones wrote:
This functionality is provided by a flush() method on similar objects,
such as the zlib compression objects.
Or by close() on other objects (htmllib, HTMLParser, the SAX incremental
parser, etc.).
Too bad there's more than one way to do it.
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains ab, the user will
never see these two
Martin v. Löwis sagte:
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains ab, the user will
Evan Jones sagte:
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.
Yes, but these are not file-like
I recently rediscovered this strange behaviour in Python's Unicode
handling. I *think* it is a bug, but before I go and try to hack
together a patch, I figure I should run it by the experts here on
Python-Dev. If you understand Unicode, please let me know if there are
problems with making
Evan Jones wrote:
I recently rediscovered this strange behaviour in Python's Unicode
handling. I *think* it is a bug, but before I go and try to hack
together a patch, I figure I should run it by the experts here on
Python-Dev. If you understand Unicode, please let me know if there are
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:
The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).
Well, it's origins do not really matter since at this point the BOM is
firmly encoded in the
26 matches
Mail list logo