Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote: The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. The codec writes a BOM in the first call to

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Nicholas Bastin wrote: On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote: The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. The codec

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote: Ok, but I don't really follow you here: you are suggesting to relax the current UTF-16 behavior and to start defaulting to UTF-16-BE if no BOM is present - that's most likely going to cause more problems that it seems to solve: namely complete

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Walter Dörwald
Nicholas Bastin sagte: On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote: [...] If you do have UTF-16 without a BOM mark it's much better to let a short function analyze the text by reading for first few bytes of the file and then make an educated guess based on the findings. You can then

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Martin v. Löwis
Nicholas Bastin wrote: It would be nice if you could optionally specify that the codec would assume UTF-16BE if no BOM was present, and not raise UnicodeError in that case, which would preserve the current behaviour as well as allow users' to ask for behaviour which conforms to the standard.

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Martin v. Löwis
Stephen J. Turnbull wrote: Of course it must be supported. My point is that many strings (in my applications, all but those strings that result from slurping in a file or process output in one go -- example, not a statistically valid sample!) are not the beginning of what once was a stream. It

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Martin v. Löwis sagte: Walter Dörwald wrote: There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report.

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Stephen J. Turnbull wrote: Martin == Martin v Löwis [EMAIL PROTECTED] writes: Martin I can't put these two paragraphs together. If you think Martin that explicit is better than implicit, why do you not want Martin to make different calls for the first chunk of a stream, Martin and

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Nicholas Bastin
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote: Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM. I've actually been confused

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Stephen J. Turnbull
Walter == Walter Dörwald [EMAIL PROTECTED] writes: Walter Not really. In every encoding where a sequence of more Walter than one byte maps to one Unicode character, you will Walter always need some kind of buffering. If we remove the Walter handling of initial BOMs from the

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
MAL == M [EMAIL PROTECTED] writes: MAL The BOM (byte order mark) was a non-standard Microsoft MAL invention to detect Unicode text data as such (MS always uses MAL UTF-16-LE for Unicode text files). The Japanese memopado (Notepad) uses UTF-8 signatures; it even adds them to

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote: So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea. I would personally like to see an utf-8-bom codec

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Martin v. Löwis wrote: Stephen J. Turnbull wrote: So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea. I would personally

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
M.-A. Lemburg wrote: [...] With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice. I'd suggest to use the same mode of operation as we have in the UTF-16 codec: it removes

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Stephen J. Turnbull wrote: MAL == M [EMAIL PROTECTED] writes: MAL The BOM (byte order mark) was a non-standard Microsoft MAL invention to detect Unicode text data as such (MS always uses MAL UTF-16-LE for Unicode text files). The Japanese memopado (Notepad) uses UTF-8

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
Martin == Martin v Löwis [EMAIL PROTECTED] writes: Martin Stephen J. Turnbull wrote: However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode). Martin With

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Walter Dörwald sagte: M.-A. Lemburg wrote: [...] With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice. I'd suggest to use the same mode of operation as we have in

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Evan Jones
On Apr 5, 2005, at 15:33, Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains ab, the user will

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Fred Drake
On Tuesday 05 April 2005 15:53, Evan Jones wrote: This functionality is provided by a flush() method on similar objects, such as the zlib compression objects. Or by close() on other objects (htmllib, HTMLParser, the SAX incremental parser, etc.). Too bad there's more than one way to do it.

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains ab, the user will never see these two

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Martin v. Löwis sagte: Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains ab, the user will

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Evan Jones sagte: On Apr 5, 2005, at 15:33, Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Walter Dörwald wrote: There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report. Yes, but these are not file-like

[Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread M.-A. Lemburg
Evan Jones wrote: I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote: The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files). Well, it's origins do not really matter since at this point the BOM is firmly encoded in the