Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Stephen J. Turnbull
> "MvL" == "Martin v. Löwis" <[EMAIL PROTECTED]> writes:

MvL> This would also support your usecase, and in a better way.
MvL> The Unicode assertion that UTF-16 is BE by default is void
MvL> these days - there is *always* a higher layer protocol, and
MvL> it more often than not specifies (perhaps not in English
MvL> words, but only in the source code of the generator) that the
MvL> default should by LE.

That is _not_ a protocol.  A protocol is a published specification,
not merely a frequent accident of implementation.  Anyway, both ISO
10646 and the Unicode standard consider that "internal use" and there
is no requirement at all placed on those data.  And such generators
typically take great advantage of that freedom---have you looked in a
.doc file recently?  Have you noticed how many different options
(previous implementations) of .doc are offered in the Import menu?

> "MAL" == "M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

MAL> I've checked the various versions of the Unicode standard
MAL> docs: it seems that the quote you have was silently
MAL> introduced between 3.0 and 4.0.

Probably because ISO 10646 was _always_ BE until the standards were
unified.  But note that ISO 10646 standardizes only use as a
communications medium.  Neither ISO 10646 nor Unicode makes any
specification about internal usage.  Conformance in internal
processing is a matter of the programmer's convenience in producing
conforming output.

MAL> Python currently uses version 3.2.0 of the standard and I
MAL> don't think enough people are aware of the change in the
MAL> standard

There's only one (corporate) person that matters: Microsoft.

MAL> By the time we switch to 4.1 or later, we can then make the
MAL> change in the native UTF-16 codec as you requested.

While in principle I sympathize with Nick, pragmatically Microsoft is
unlikely to conform.  They will take the position that files created
by Windows are "internal" to the Windows environment, except where
explicitly intended for exchange with arbitrary platforms, and only
then will they conform.  As Martin points out, that is what really
matters for these defaults.  I think you should look to see what
Microsoft does.

MAL> Personally, I think that the Unicode consortium should not
MAL> have introduced a default for the UTF-16 encoding byte
MAL> order. Using big endian as default in a world where most
MAL> Unicode data is created on little endian machines is not very
MAL> realistic either.

It's not a default for the UTF-16 encoding byte order.  It's a default
for the UTF-16 encoding byte order _when UTF-16 is a communications
medium_.  Given that the generic network byte order is bigendian, I
think it would be insane to specify littleendian as Unicode's default.

With Unicode same as network, you specify UTF-16 strings internally as
an array of uint16_t, and when you put them on the wire (including
saving them to a file that might be put on the wire as octet-stream)
you apply htons(3) to it.  On reading, you apply ntohs(3) to it.  The
source code is portable, the file is portable.  How can you beat that?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> Nicholas Bastin wrote:
> 
>>It would be nice if you could optionally specify that the codec would
>>assume UTF-16BE if no BOM was present, and not raise UnicodeError in
>>that case, which would preserve the current behaviour as well as allow
>>users' to ask for behaviour which conforms to the standard.
> 
> 
> Alternatively, the UTF-16BE codec could support the BOM, and do
> UTF-16LE if the "other" BOM is found.

That would violate the Unicode standard - the BOM character
for UTF-16-LE and -BE must be interpreted as ZWNBSP.

> This would also support your usecase, and in a better way. The
> Unicode assertion that UTF-16 is BE by default is void these
> days - there is *always* a higher layer protocol, and it more
> often than not specifies (perhaps not in English words, but
> only in the source code of the generator) that the default should
> by LE.

I've checked the various versions of the Unicode standard
docs: it seems that the quote you have was silently introduced
between 3.0 and 4.0.

Python currently uses version 3.2.0 of the standard and I don't
think enough people are aware of the change in the standard to make
a case for dropping the exception raising in the case of a UTF-16
finding a stream without a BOM mark.

By the time we switch to 4.1 or later, we can then
make the change in the native UTF-16 codec as you
requested.

Personally, I think that the Unicode consortium should not
have introduced a default for the UTF-16 encoding byte
order. Using big endian as default in a world where most
Unicode data is created on little endian machines is not
very realistic either.

Note that the UTF-16 codec starts reading data in
the machines native byte order and then learns a possibly
different byte order by looking for BOMs.

Implementing a codec which implements the 4.0 behavior
is easy, though.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Walter Dörwald
Walter Dörwald sagte:

> Nicholas Bastin sagte:
>
> It should be feasible to implement your own codec for that
> based on Lib/encodings/utf_16.py. Simply replace the line
> in StreamReader.decode():
>   raise UnicodeError,"UTF-16 stream does not start with BOM"
> with:
>   self.decode = codecs.utf_16_be_decode
> and you should be done.

Oops, this only works if you have a big endian system.
Otherwise you have to redecode the input with:
   codecs.utf_16_ex_decode(input, errors, 1, False)

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Martin v. Löwis
Nicholas Bastin wrote:
> It would be nice if you could optionally specify that the codec would
> assume UTF-16BE if no BOM was present, and not raise UnicodeError in
> that case, which would preserve the current behaviour as well as allow
> users' to ask for behaviour which conforms to the standard.

Alternatively, the UTF-16BE codec could support the BOM, and do
UTF-16LE if the "other" BOM is found.

This would also support your usecase, and in a better way. The
Unicode assertion that UTF-16 is BE by default is void these
days - there is *always* a higher layer protocol, and it more
often than not specifies (perhaps not in English words, but
only in the source code of the generator) that the default should
by LE.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Walter Dörwald
Nicholas Bastin sagte:

> On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
>
> [...]
>> If you do have UTF-16 without a BOM mark it's much better
>> to let a short function analyze the text by reading for first
>> few bytes of the file and then make an educated guess based
>> on the findings. You can then process the file using one
>> of the other codecs UTF-16-LE or -BE.
>
> This is about what we do now - we catch UnicodeError and
> then add a BOM  to the file, and read it again.  We know
> our files are UTF-16BE if they  don't have a BOM, as the
> files are written by code which observes the  spec.
> We can't use UTF-16BE all the time, because sometimes
> they're UTF-16LE, and in those cases the BOM is set.
>
> It would be nice if you could optionally specify that the
> codec would assume UTF-16BE if no BOM was present,
> and not raise UnicodeError in  that case, which would
> preserve the current behaviour as well as allow users'
> to ask for behaviour which conforms to the standard.

It should be feasible to implement your own codec for that
based on Lib/encodings/utf_16.py. Simply replace the line
in StreamReader.decode():
   raise UnicodeError,"UTF-16 stream does not start with BOM"
with:
   self.decode = codecs.utf_16_be_decode
and you should be done.

> [...]

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
Ok, but I don't really follow you here: you are suggesting to
relax the current UTF-16 behavior and to start defaulting to
UTF-16-BE if no BOM is present - that's most likely going to
cause more problems that it seems to solve: namely complete
garbage if the data turns out to be UTF-16-LE encoded and,
what's worse, enters the application undetected.
The crux of my argument is that the spec declares that UTF-16 without a 
BOM is BE.  If the file is encoded in UTF-16LE and it doesn't have a 
BOM, it doesn't deserve to be processed correctly.  That being said, 
treating it as UTF-16BE if it's LE will result in a lot of invalid code 
points, so it shouldn't be non-obvious that something has gone wrong.

If you do have UTF-16 without a BOM mark it's much better
to let a short function analyze the text by reading for first
few bytes of the file and then make an educated guess based
on the findings. You can then process the file using one
of the other codecs UTF-16-LE or -BE.
This is about what we do now - we catch UnicodeError and then add a BOM 
to the file, and read it again.  We know our files are UTF-16BE if they 
don't have a BOM, as the files are written by code which observes the 
spec.  We can't use UTF-16BE all the time, because sometimes they're 
UTF-16LE, and in those cases the BOM is set.

It would be nice if you could optionally specify that the codec would 
assume UTF-16BE if no BOM was present, and not raise UnicodeError in 
that case, which would preserve the current behaviour as well as allow 
users' to ask for behaviour which conforms to the standard.

I'm not saying that you can't work around the issue now, what I'm 
saying is that you shouldn't *have* to - I think there is a reasonable 
expectation that the UTF-16 codec conforms to the spec, and if you 
wanted it to do something else, it is those users who should be forced 
to come up with a workaround.

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Nicholas Bastin wrote:
> 
> On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
> 
>>> The current implementation of the utf-16 codecs makes for some
>>> irritating gymnastics to write the BOM into the file before reading it
>>> if it contains no BOM, which seems quite like a bug in the codec.
>>
>>
>> The codec writes a BOM in the first call to .write() - it
>> doesn't write a BOM before reading from the file.
> 
> 
> Yes, see, I read a *lot* of UTF-16 that comes from other sources.  It's
> not a matter of writing with python and reading with python.

Ok, but I don't really follow you here: you are suggesting to
relax the current UTF-16 behavior and to start defaulting to
UTF-16-BE if no BOM is present - that's most likely going to
cause more problems that it seems to solve: namely complete
garbage if the data turns out to be UTF-16-LE encoded and,
what's worse, enters the application undetected.

If you do have UTF-16 without a BOM mark it's much better
to let a short function analyze the text by reading for first
few bytes of the file and then make an educated guess based
on the findings. You can then process the file using one
of the other codecs UTF-16-LE or -BE.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some
irritating gymnastics to write the BOM into the file before reading it
if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write() - it
doesn't write a BOM before reading from the file.
Yes, see, I read a *lot* of UTF-16 that comes from other sources.  It's 
not a matter of writing with python and reading with python.

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Nicholas Bastin wrote:
> 
> On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
> 
>> Note that the UTF-16 codec is strict w/r to the presence
>> of the BOM mark: you get a UnicodeError if a stream does
>> not start with a BOM mark. For the UTF-8-SIG codec, this
>> should probably be relaxed to not require the BOM.
> 
> 
> I've actually been confused about this point for quite some time now,
> but never had a chance to bring it up.  I do not understand why
> UnicodeError should be raised if there is no BOM.  I know that PEP-100
> says:
> 
> 'utf-16': 16-bit variable length encoding (little/big endian)
> 
> and:
> 
> Note: 'utf-16' should be implemented by using and requiring byte order
> marks (BOM) for file input/output.
> 
> But this appears to be in error, at least in the current unicode
> standard.  'utf-16', as defined by the unicode standard, is big-endian
> in the absence of a BOM:
> 
> ---
> 3.10.D42:  UTF-16 encoding scheme:
> ...
> * The UTF-16 encoding scheme may or may not begin with a BOM.  However,
> when there is no BOM, and in the absence of a higher-level protocol, the
> byte order of the UTF-16 encoding scheme is big-endian.
> ---

The problem is "in the absence of a higher level protocol": the
codec doesn't know anything about a protocol - it's the application
using the codec that knows which protocol get's used. It's a lot
safer to require the BOM for UTF-16 streams and raise an exception
to have the application decide whether to use UTF-16-BE or the
by far more common UTF-16-LE.

Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration
parameter, not merely a signature.

In terms of history, I don't recall whether your quote was
already in the standard at the time I wrote the PEP. You are the
first to have reported a problem with the current implementation
(which has been around since 2000), so I believe that application
writers are more comfortable with the way the UTF-16 codec
is currently implemented. Explicit is better than implicit :-)

> The current implementation of the utf-16 codecs makes for some
> irritating gymnastics to write the BOM into the file before reading it
> if it contains no BOM, which seems quite like a bug in the codec. 

The codec writes a BOM in the first call to .write() - it
doesn't write a BOM before reading from the file.

> I allow for the possibility that this was ambiguous in the standard when
> the PEP was written, but it is certainly not ambiguous now.

See above.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Stephen J. Turnbull
> "Walter" == Walter Dörwald <[EMAIL PROTECTED]> writes:

Walter> Not really. In every encoding where a sequence of more
Walter> than one byte maps to one Unicode character, you will
Walter> always need some kind of buffering. If we remove the
Walter> handling of initial BOMs from the codecs (except for
Walter> UTF-16 where it is required), this wouldn't change any
Walter> buffering requirements.

Sure.  My point is that codecs should be stateful only to the extent
needed to assemble semantically meaningful units (ie, multioctet coded
characters).  In particular, they should not need to know about
location at the beginning, middle, or end of some stream---because in
the context of operating on a string they _can't_.

>> I don't know whether that's really feasible in the short
>> run---I suspect there may be a lot of stream-like modules that
>> would need to be updated---but it would be a saner in the long
>> run.

Walter> I'm not exactly sure, what you're proposing here. That all
Walter> codecs (even UTF-16) pass the BOM through and some other
Walter> infrastructure is responsible for dropping it?

Not exactly.  I think that at the lowest level codecs should not
implement complex mode-switching internally, but rather explicitly
abdicate responsibility to a more appropriate codec.

For example, autodetecting UTF-16 on input would be implemented by a
Python program that does something like

data = stream.read()
for detector in [ "utf-16-signature", "utf-16-statistical" ]:
# for the UTF-16 detectors, OUT will always be u"" or None
out, data, codec = data.decode(detector)
if codec: break
while codec:
more_out, data, codec = data.decode(codec)
out = out + more_out
if data:
# a real program would complain about it
pass
process(out)

where decode("utf-16-signature") would be implemented

def utf-16-signature-internal (data):
if data[0:2] == "\xfe\xff":
return (u"", data[2:], "utf-16-be")
else if data[0:2] == "\xff\xfe":
return (u"", data[2:], "utf-16-le")
else
# note: data is undisturbed if the detector fails
return (None, data, None)

The main point is that the detector is just a codec that stops when it
figures out what the next codec should be, touches only data that
would be incorrect to pass to the next codec, and leaves the data
alone if detection fails.  utf-16-signature only handles the BOM (if
present), and does not handle arbitrary "chunks" of data.  Instead, it
passes on the rest of the data (including the first chunk) to be
handled by the appropriate utf-16-?e codec.

I think that the temptation to encapsulate this logic in a utf-16
codec that "simplifies" things by calling the appropriate utf-16-?e
codec itself should be deprecated, but YMMV.  What I would really like
is for the above style to be easier to achieve than it currently is.

BTW, I appreciate your patience in exploring this; after Martin's
remark about different mental models I have to suspect this approach
is just somehow un-Pythonic, but fleshing it out this way I can see
how it will be useful in the context of a different project.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Nicholas Bastin
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.
I've actually been confused about this point for quite some time now, 
but never had a chance to bring it up.  I do not understand why 
UnicodeError should be raised if there is no BOM.  I know that PEP-100 
says:

'utf-16': 16-bit variable length encoding (little/big 
endian)

and:
Note: 'utf-16' should be implemented by using and requiring byte order 
marks (BOM) for file input/output.

But this appears to be in error, at least in the current unicode 
standard.  'utf-16', as defined by the unicode standard, is big-endian 
in the absence of a BOM:

---
3.10.D42:  UTF-16 encoding scheme:
...
* The UTF-16 encoding scheme may or may not begin with a BOM.  However, 
when there is no BOM, and in the absence of a higher-level protocol, 
the byte order of the UTF-16 encoding scheme is big-endian.
---

The current implementation of the utf-16 codecs makes for some 
irritating gymnastics to write the BOM into the file before reading it 
if it contains no BOM, which seems quite like a bug in the codec.  I 
allow for the possibility that this was ambiguous in the standard when 
the PEP was written, but it is certainly not ambiguous now.

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
Because the signature/BOM is not a chunk, it's a header.  Handling the
signature/BOM is part of stream initialization, not translation, to my
mind.
I'm sorry, but I'm losing track as to what precisely you are trying to
say. You seem to be using a mental model that is entirely different
from mine.
The point is that explicitly using a stream shows that initialization
(and finalization) matter.  The default can be BOM or not, as a
pragmatic matter.  But then the stream data itself can be treated
homogeneously, as implied by the notion of stream.
But what follows from that point? So it shows some kind of matter...
what does that mean for actual changes to Python API?
I think it probably also would solve Walter's conundrum about
buffering the signature/BOM if responsibility for that were moved out
of the codecs and into the objects where signatures make sense.
I don't know whether that's really feasible in the short run---I
suspect there may be a lot of stream-like modules that would need to
be updated---but it would be a saner in the long run.
What is "that" which might be really feasible? To "solve Walter's
conundrum"? That "signatures make sense"?
So I can't really respond to your message in a meaningful way;
I just let it rest...
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Stephen J. Turnbull wrote:
"Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
Martin> I can't put these two paragraphs together. If you think
Martin> that explicit is better than implicit, why do you not want
Martin> to make different calls for the first chunk of a stream,
Martin> and the subsequent chunks?
Because the signature/BOM is not a chunk, it's a header.  Handling the
signature/BOM is part of stream initialization, not translation, to my
mind.
The point is that explicitly using a stream shows that initialization
(and finalization) matter.  The default can be BOM or not, as a
pragmatic matter.  But then the stream data itself can be treated
homogeneously, as implied by the notion of stream.
I think it probably also would solve Walter's conundrum about
buffering the signature/BOM if responsibility for that were moved out
of the codecs and into the objects where signatures make sense.
Not really. In every encoding where a sequence of more than one byte 
maps to one Unicode character, you will always need some kind of 
buffering. If we remove the handling of initial BOMs from the codecs 
(except for UTF-16 where it is required), this wouldn't change any 
buffering requirements.

I don't know whether that's really feasible in the short run---I
suspect there may be a lot of stream-like modules that would need to
be updated---but it would be a saner in the long run.
I'm not exactly sure, what you're proposing here. That all codecs (even 
UTF-16) pass the BOM through and some other infrastructure is 
responsible for dropping it?

[...]
Bye,
   Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:

Martin> I can't put these two paragraphs together. If you think
Martin> that explicit is better than implicit, why do you not want
Martin> to make different calls for the first chunk of a stream,
Martin> and the subsequent chunks?

Because the signature/BOM is not a chunk, it's a header.  Handling the
signature/BOM is part of stream initialization, not translation, to my
mind.

The point is that explicitly using a stream shows that initialization
(and finalization) matter.  The default can be BOM or not, as a
pragmatic matter.  But then the stream data itself can be treated
homogeneously, as implied by the notion of stream.

I think it probably also would solve Walter's conundrum about
buffering the signature/BOM if responsibility for that were moved out
of the codecs and into the objects where signatures make sense.

I don't know whether that's really feasible in the short run---I
suspect there may be a lot of stream-like modules that would need to
be updated---but it would be a saner in the long run.

>> Yes!  Exactly (except in reverse, we want to _read_ from the
>> slurped stream-as-string, not write to one)!  ... and there's
>> no need for a utf-8-sig codec for strings, since you can
>> support the usage in exactly this way.

Martin> However, if there is an utf-8-sig codec for streams, there
Martin> is currently no way of *preventing* this codec to also be
Martin> available for strings. The very same code is used for
Martin> streams and for strings, and automatically so.

And of course it should be.  But if it's not possible to move the -sig
facility out of the codecs into the streams, that would be a shame.  I
think we should encourage people to use streams where initialization or
finalization semantics are non-trivial, as they are with signatures.

But as long as both utf-8-we-dont-need-no-steenkin-sigs-in-strings and
utf-8-sig are available, I can program as I want to (and refer those
whose strings get cratered by stray BOMs to you).

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Martin v. Löwis sagte:
> Walter Dörwald wrote:
>> There are situations where the byte stream might be temporarily
>> exhausted, e.g. an XML parser that tries to support the
>> IncrementalParser interface, or when you want to decode
>> encoded data piecewise, because you want to give a progress
>> report.
>
> Yes, but these are not file-like objects.

True, on the outside there are no file-like objects. But the
IncrementalParser gets passed the XML bytes in chunks,
so it has to use a stateful decoder for decoding. Unfortunately
this means that is has to use a stream API. (See
http://www.python.org/sf/1101097 for a patch that somewhat
fixes that.)

(Another option would be to completely ignore the stateful API
and handcraft stateful decoding (or only support stateless
decoding), like most XML parsers for Python do now.)

> In the IncrementalParser,
> it is *not* the case that a read operation returns an empty
> string. Instead, the application repeatedly feeds data explicitly.

That's true, but the parser has to wrap this data into an object
that can be passed to the StreamReader constructor. (See the
Queue class in Lib/test/test_codecs.py for an example.)

> For a file-like object, returning "" indicates EOF.

Not neccassarily. In the example above the IncrementalParser
gets fed a chunk of data, it stuffs this data into the Queue,
so that the StreamReader can decode it. Once the data
from the Queue is exhausted, there won't any further
data until the user calls feed() on the IncrementalParser again.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
Of course it must be supported.  My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream".  It is
error-prone (not to mention unaesthetic) to not make that distinction.
"Explicit is better than implicit."
I can't put these two paragraphs together. If you think that explicit
is better than implicit, why do you not want to make different calls
for the first chunk of a stream, and the subsequent chunks?
 >>> s=cStringIO.StringIO()
 >>> s1=codecs.getwriter("utf-8")(s)
 >>> s1.write(u"Hallo")
 >>> s.getvalue()
'Hallo'
Yes!  Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)!  ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.
However, if there is an utf-8-sig codec for streams, there is currently
no way of *preventing* this codec to also be available for strings. The
very same code is used for streams and for strings, and automatically
so.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:

Martin> So people do use the "decode-it-all" mode, where no
Martin> sequential access is necessary - yet the beginning of the
Martin> string is still the beginning of what once was a
Martin> stream. This case must be supported.

Of course it must be supported.  My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream".  It is
error-prone (not to mention unaesthetic) to not make that distinction.

"Explicit is better than implicit."

Martin> Whether or not to use the codec would be the application's
Martin> choice.

>> What I think should be provided is a stateful object
>> encapsulating the codec.  Ie, to avoid the need to write

>> out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")

Martin> No. People who want streaming should use cStringIO, i.e.

 >>> s=cStringIO.StringIO()
 >>> s1=codecs.getwriter("utf-8")(s)
 >>> s1.write(u"Hallo")
 >>> s.getvalue()
'Hallo'

Yes!  Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)!  ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread "Martin v. Löwis"
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.
Yes, but these are not file-like objects. In the IncrementalParser,
it is *not* the case that a read operation returns an empty
string. Instead, the application repeatedly feeds data explicitly.
For a file-like object, returning "" indicates EOF.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Evan Jones sagte:
> On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only contains "ab", the user will
>> never see these two characters.
>
> Shouldn't the decoder be capable of doing a partial match and quitting  
> early? After all, "ab" is encoded in UTF8 as <61>
> <62> but the BOM is. If it did this type of partial matching, 
> this issue  would be avoided except in rare
> situations.
>
>> A solution for this would be to add an argument named final to
>> the decode and read methods that tells the decoder that the
>> stream has ended and the remaining buffered bytes have to be
>> handled now.
>
> This functionality is provided by a flush() method on similar objects,  such 
> as the zlib compression objects.

Theoretically the name is unimportant, but read(..., final=True) or flush()
or close() should subject the pending bytes to normal error handling and
must return the result of decoding these pending bytes just like the
other methods do. This would mean that we would have to implement
a decodecode(), a readclose() and a readlineclose(). IMHO it would be
best to add this argument to decode, read and readline directly. But I'm
not sure, what this would mean for iterating through a StreamReader.

Bye,
Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Martin v. Löwis sagte:
> Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only contains "ab", the user will
>> never see these two characters.
>
> This can be improved, of course: If the first byte is "a",
> it most definitely is *not* an UTF-8 signature.
>
> So we only need a second byte for the characters between U+F000
> and U+, and a third byte only for the characters
> U+FEC0...U+FEFF. But with the first byte being  \xef, we need
> three bytes *anyway*, so we can always decide with the first
> byte only whether we need to wait for three bytes.

OK, I've updated the patch so that the first bytes will only be kept
in the buffer if they are a prefix of the BOM.

>> A solution for this would be to add an argument named final to
>> the decode and read methods that tells the decoder that the
>> stream has ended and the remaining buffered bytes have to be
>> handled now.
>
> Shouldn't an empty read from the underlying stream be taken
> as an EOF?

There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.
This can be improved, of course: If the first byte is "a", it most
definitely is *not* an UTF-8 signature.
So we only need a second byte for the characters between U+F000
and U+, and a third byte only for the characters
U+FEC0...U+FEFF. But with the first byte being  \xef, we need
three bytes *anyway*, so we can always decide with the first
byte only whether we need to wait for three bytes.
A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.
Shouldn't an empty read from the underlying stream be taken
as an EOF?
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Fred Drake
On Tuesday 05 April 2005 15:53, Evan Jones wrote:
 > This functionality is provided by a flush() method on similar objects,
 > such as the zlib compression objects.

Or by close() on other objects (htmllib, HTMLParser, the SAX incremental 
parser, etc.).

Too bad there's more than one way to do it.  :-(


  -Fred

-- 
Fred L. Drake, Jr.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Evan Jones
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.
Shouldn't the decoder be capable of doing a partial match and quitting 
early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is 
  . If it did this type of partial matching, this issue 
would be avoided except in rare situations.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.
This functionality is provided by a flush() method on similar objects, 
such as the zlib compression objects.

Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Walter Dörwald sagte:

> M.-A. Lemburg wrote:
>
>>> [...]
>>>With the UTF-8-SIG codec, it would apply to all operation
>>> modes of the codec, whether stream-based or from strings. Whether
>>>or not to use the codec would be the application's choice.
>>
>> I'd suggest to use the same mode of operation as we have in
>> the UTF-16 codec: it removes the BOM mark on the first call
>> to the StreamReader .decode() method and writes a BOM mark
>> on the first call to .encode() on a StreamWriter.
>>
>> Note that the UTF-16 codec is strict w/r to the presence
>> of the BOM mark: you get a UnicodeError if a stream does
>> not start with a BOM mark. For the UTF-8-SIG codec, this
>> should probably be relaxed to not require the BOM.
>
> I've started writing such a codec. Making the BOM optional
> on decoding definitely simplifies the implementation.

OK, here is the patch: http://www.python.org/sf/1177307

The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
Martin> With the UTF-8-SIG codec, it would apply to all operation
Martin> modes of the codec, whether stream-based or from strings.
I had in mind the ability to treat a string as a stream.
Hmm. A string is not a stream, but it could be the contents of a stream.
A typical application of codecs goes like this:
data = stream.read()
[analyze data, e.g. by checking whether there is encoding= in 
So people do use the "decode-it-all" mode, where no sequential access
is necessary - yet the beginning of the string is still the beginning of
what once was a stream. This case must be supported.
Martin> Whether or not to use the codec would be the application's
Martin> choice.
What I think should be provided is a stateful object encapsulating the
codec.  Ie, to avoid the need to write
out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")
No. People who want streaming should use cStringIO, i.e.
>>> s=cStringIO.StringIO()
>>> s1=codecs.getwriter("utf-8")(s)
>>> s1.write(u"Hallo")
>>> s.getvalue()
'Hallo'
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
>>"MAL" == M  <[EMAIL PROTECTED]> writes:

MAL> Stephen J. Turnbull wrote:

>> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it
>> even adds them to existing UTF-8 files lacking them.

MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS
MAL> Office always use UTF-16-LE + BOM when saving text as "Unicode
MAL> text".

Yes, it is an MS application.  I'll have to borrow somebody's box to
check, but IIRC UTF-8 is the native "text" encoding for Japanese now.
(Japanized applications generally behave differently from everything
else, as there are so many "standards" for encoding Japanese.)

M> The UTF-16 stream codecs implement this logic.

M> The UTF-16 encode and decode functions will however always
M> strip the BOM mark from the beginning of a string.

M> If the application doesn't want this stripping to happen, it
M> should use the UTF-16-LE or -BE codec resp.

That sounds like it would work fine almost all the time.  If it
doesn't it's straightforward to work around, and certainly would be
more convenient for the non-standards-geek programmer.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:

Martin> Stephen J. Turnbull wrote:

>> However, this option should be part of the initialization of an
>> IO stream which produces Unicodes, _not_ an operation on
>> arbitrary internal strings (whether raw or Unicode).

Martin> With the UTF-8-SIG codec, it would apply to all operation
Martin> modes of the codec, whether stream-based or from strings.

I had in mind the ability to treat a string as a stream.

Martin> Whether or not to use the codec would be the application's
Martin> choice.

What I think should be provided is a stateful object encapsulating the
codec.  Ie, to avoid the need to write

out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")



-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Stephen J. Turnbull wrote:
>>"MAL" == M  <[EMAIL PROTECTED]> writes:
> 
> 
> MAL> The BOM (byte order mark) was a non-standard Microsoft
> MAL> invention to detect Unicode text data as such (MS always uses
> MAL> UTF-16-LE for Unicode text files).
> 
> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
> them to existing UTF-8 files lacking them.

Is that a MS application ? AFAIK, notepad, wordpad and MS Office
always use UTF-16-LE + BOM when saving text as "Unicode text".

> MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
> MAL> codecs module was probably a mistake to begin with. You
> MAL> usually only get UTF-8 files with BOM marks as the result of
> MAL> recoding UTF-16 files into UTF-8.
> 
> There is a standard for UTF-8 _signatures_, however.  I don't have the
> most recent version of the ISO-10646 standard, but Amendment 2 (which
> defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
> Annex F of that standard.  Evan quotes Version 4 of the Unicode
> standard, which explicitly defines the UTF-8 signature.

Ok, as signature the BOM does make some sense - whether to
strip signatures from a document is a good idea or not
is a different matter, though.

Here's the Unicode Cons. FAQ on the subject:

http://www.unicode.org/faq/utf_bom.html#22

They also explicitly warn about adding BOMs to UTF-8 data
since it can break applications and protocols that do not
expect such a signature.

> So there is a standard for the UTF-8 signature, and I know of
> applications which produce it.  While I agree with you that Python's
> codecs shouldn't produce it (by default), providing an option to strip
> is a good idea.
> 
> However, this option should be part of the initialization of an IO
> stream which produces Unicodes, _not_ an operation on arbitrary
> internal strings (whether raw or Unicode).

Right.

> MAL> BTW, how do you know that s came from the start of a file and
> MAL> not from slicing some already loaded file somewhere in the
> MAL> middle ?
> 
> The programmer or the application might, but Python's codecs don't.
> The point is that this is also true of rawstrings that happen to
> contain UTF-16 or UTF-32 data.  The UTF-16 ("auto-endian") codec
> shouldn't strip leading BOMs either, unless it has been told it has
> the beginning of the string.

The UTF-16 stream codecs implement this logic.

The UTF-16 encode and decode functions will however always strip
the BOM mark from the beginning of a string.

If the application doesn't want this stripping to happen,
it should use the UTF-16-LE or -BE codec resp.

> MAL> Evan Jones wrote:
> 
> >> This is *not* a valid Unicode character. The Unicode
> >> specification (version 4, section 15.8) says the following
> >> about non-characters:
> >> 
> >>> Applications are free to use any of these noncharacter code
> >>> points internally but should never attempt to exchange
> >>> them. If a noncharacter is received in open interchange, an
> >>> application is not required to interpret it in any way. It is
> >>> good practice, however, to recognize it as a noncharacter and
> >>> to take appropriate action, such as removing it from the
> >>> text. Note that Unicode conformance freely allows the removal
> >>> of these characters. (See C10 in Section3.2, Conformance
> >>> Requirements.)
> >> 
> >> My interpretation of the specification means that Python should
> 
> The specification _permits_ silent removal; it does not recommend.
> 
> >> silently remove the character, resulting in a zero length
> >> Unicode string.  Similarly, both of the following lines should
> >> also result in a zero length Unicode string:
> 
>  '\xff\xfe\xfe\xff'.decode( "utf16" )
> > u'\ufffe'
>  '\xff\xfe\xff\xff'.decode( "utf16" )
> > u'\u'
> 
> I strongly disagree; these decisions should be left to a higher layer.
> In the case of specified UTFs, the codecs should simply invert the UTF
> to Python's internal encoding.
> 
> MAL> Hmm, wouldn't it be better to raise an error ? After all, a
> MAL> reversed BOM mark in the stream looks a lot like you're
> MAL> trying to decode a UTF-16 stream assuming the wrong byte
> MAL> order ?!
> 
> +1 on (optionally) raising an error. 

The advantage of raising an error is that the application
can deal with the situation in whatever way seems fit (by
registering a special error handler or by simply using
"ignore" or "replace").

I agree that much of this lies outside the scope of codecs
and should be handled at an application or protocol level.

> -1 on removing it or anything
> like that, unless under control of the application (ie, the program
> written in Python, not Python itself).  It's far too easy for software
> to generate broken Unicode streams[1], and the choice of how to deal
> with those should be with the application, not with the im

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
M.-A. Lemburg wrote:
[...]
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes the BOM mark on the first call
to the StreamReader .decode() method and writes a BOM mark
on the first call to .encode() on a StreamWriter.
Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.
I've started writing such a codec. Making the BOM optional on decoding 
definitely simplifies the implementation.

Bye,
   Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> Stephen J. Turnbull wrote:
> 
>> So there is a standard for the UTF-8 signature, and I know of
>> applications which produce it.  While I agree with you that Python's
>> codecs shouldn't produce it (by default), providing an option to strip
>> is a good idea.
> 
> I would personally like to see an "utf-8-bom" codec (perhaps better
> named "utf-8-sig", which strips the BOM on reading (if present)
> and generates it on writing.

+1.

>> However, this option should be part of the initialization of an IO
>> stream which produces Unicodes, _not_ an operation on arbitrary
>> internal strings (whether raw or Unicode).
> 
> 
> With the UTF-8-SIG codec, it would apply to all operation modes of
> the codec, whether stream-based or from strings. Whether or not to
> use the codec would be the application's choice.

I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes the BOM mark on the first call
to the StreamReader .decode() method and writes a BOM mark
on the first call to .encode() on a StreamWriter.

Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 05 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread "Martin v. Löwis"
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of
applications which produce it.  While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
I would personally like to see an "utf-8-bom" codec (perhaps better
named "utf-8-sig", which strips the BOM on reading (if present)
and generates it on writing.
However, this option should be part of the initialization of an IO
stream which produces Unicodes, _not_ an operation on arbitrary
internal strings (whether raw or Unicode).
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-04 Thread Stephen J. Turnbull
> "MAL" == M  <[EMAIL PROTECTED]> writes:

MAL> The BOM (byte order mark) was a non-standard Microsoft
MAL> invention to detect Unicode text data as such (MS always uses
MAL> UTF-16-LE for Unicode text files).

The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
them to existing UTF-8 files lacking them.

MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
MAL> codecs module was probably a mistake to begin with. You
MAL> usually only get UTF-8 files with BOM marks as the result of
MAL> recoding UTF-16 files into UTF-8.

There is a standard for UTF-8 _signatures_, however.  I don't have the
most recent version of the ISO-10646 standard, but Amendment 2 (which
defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
Annex F of that standard.  Evan quotes Version 4 of the Unicode
standard, which explicitly defines the UTF-8 signature.

So there is a standard for the UTF-8 signature, and I know of
applications which produce it.  While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.

However, this option should be part of the initialization of an IO
stream which produces Unicodes, _not_ an operation on arbitrary
internal strings (whether raw or Unicode).

MAL> BTW, how do you know that s came from the start of a file and
MAL> not from slicing some already loaded file somewhere in the
MAL> middle ?

The programmer or the application might, but Python's codecs don't.
The point is that this is also true of rawstrings that happen to
contain UTF-16 or UTF-32 data.  The UTF-16 ("auto-endian") codec
shouldn't strip leading BOMs either, unless it has been told it has
the beginning of the string.

MAL> Evan Jones wrote:

>> This is *not* a valid Unicode character. The Unicode
>> specification (version 4, section 15.8) says the following
>> about non-characters:
>> 
>>> Applications are free to use any of these noncharacter code
>>> points internally but should never attempt to exchange
>>> them. If a noncharacter is received in open interchange, an
>>> application is not required to interpret it in any way. It is
>>> good practice, however, to recognize it as a noncharacter and
>>> to take appropriate action, such as removing it from the
>>> text. Note that Unicode conformance freely allows the removal
>>> of these characters. (See C10 in Section3.2, Conformance
>>> Requirements.)
>> 
>> My interpretation of the specification means that Python should

The specification _permits_ silent removal; it does not recommend.

>> silently remove the character, resulting in a zero length
>> Unicode string.  Similarly, both of the following lines should
>> also result in a zero length Unicode string:

 '\xff\xfe\xfe\xff'.decode( "utf16" )
> u'\ufffe'
 '\xff\xfe\xff\xff'.decode( "utf16" )
> u'\u'

I strongly disagree; these decisions should be left to a higher layer.
In the case of specified UTFs, the codecs should simply invert the UTF
to Python's internal encoding.

MAL> Hmm, wouldn't it be better to raise an error ? After all, a
MAL> reversed BOM mark in the stream looks a lot like you're
MAL> trying to decode a UTF-16 stream assuming the wrong byte
MAL> order ?!

+1 on (optionally) raising an error.  -1 on removing it or anything
like that, unless under control of the application (ie, the program
written in Python, not Python itself).  It's far too easy for software
to generate broken Unicode streams[1], and the choice of how to deal
with those should be with the application, not with the implementation
language.



Footnotes: 
[1]  An egregious example was the Outlook Express distributed with
early Win2k betas, which produced MIME bodies with apparent
Content-Type: text/html; charset=utf-16, but the HTML tags and
newlines were 7-bit ASCII!

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:
The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).
Well, it's origins do not really matter since at this point the BOM is 
firmly encoded in the Unicode standard. It seems to me that it is in 
everyone's best interest to support it.

It is not needed for the UTF-8 because that format doesn't rely on
the byte order and the BOM character at the beginning of a stream is
a legitimate ZWNBSP (zero width non breakable space) code point.
You are correct: it is a legitimate character. However, its use as a 
ZWNBSP character has been deprecated:

The overloading of semantics for this code point has caused problems 
for programs and protocols. The new character U+2060 WORD JOINER has 
the same semantics in all cases as U+FEFF, except that it cannot be 
used as a signature. Implementers are strongly encouraged to use word 
joiner in those circumstances whenever word joining semantics is 
intended.
Also, the Unicode specification is ambiguous on what an implementation 
should do about a leading ZWNBSP that is encoded in UTF-16. Like I 
mentioned, if you look at the Unicode standard, version 4, section 
15.9, it says:

2. Unmarked Character Set. In some circumstances, the character set 
information for a stream of coded characters (such as a file) is not 
available. The only information available is that the stream contains 
text, but the precise character set is not known.
This seems to indicate that it is permitted to strip the BOM from the 
beginning of UTF-8 text.

-1; there's no standard for UTF-8 BOMs - adding it to the
codecs module was probably a mistake to begin with. You usually
only get UTF-8 files with BOM marks as the result of recoding
UTF-16 files into UTF-8.
This is clearly incorrect. The UTF-8 is specified in the Unicode 
standard version 4, section 15.9:

In UTF-8, the BOM corresponds to the byte sequence .
I normally find files with UTF-8 BOMs from many Windows applications 
when you save a text file as UTF8. I think that Notepad or WordPad does 
this, for example. I think UltraEdit also does the same thing. I know 
that Scintilla definitely does.

At the very least, it would be nice to add a note about this to the
documentation, and possibly add this example function that implements
the "UTF-8 or ASCII?" logic.
Well, I'd say that's a very English way of dealing with encoded
text ;-)
Please note I am saying only that something like this may want to me 
considered for addition to the documentation, and not to the Python 
standard library. This example function more closely replicates the 
logic that is used on those Windows applications when opening ".txt" 
files. It uses the default locale if there is no BOM:

def autodecode( s ):
if s.beginswith( codecs.BOM_UTF8 ):
# The byte string s is UTF-8
out = s.decode( "utf8" )
return out[1:]
else: return s.decode()
BTW, how do you know that s came from the start of a file
and not from slicing some already loaded file somewhere
in the middle ?
Well, the same argument could be applied to the UTF-16 decoder know 
that the string came from the start of a file, and not from slicing 
some already loaded file? The standard states that:

In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file 
or stream explicitly signals the byte order.
So it is perfectly permissible to perform this type of processing if 
you consider a string to be equivalent to a stream.

My interpretation of the specification means that Python should 
silently
remove the character, resulting in a zero length Unicode string.
Hmm, wouldn't it be better to raise an error ? After all,
a reversed BOM mark in the stream looks a lot like you're
trying to decode a UTF-16 stream assuming the wrong
byte order ?!
Well, either one is possible, however the Unicode standard suggests, 
but does not require, silently removing them:

It is good practice, however, to recognize it as a noncharacter and to 
take appropriate action, such as removing it from the text. Note that 
Unicode conformance freely allows the removal of these characters.
I would prefer silently ignoring them from the str.decode() function, 
since I believe in "be strict in what you emit, but liberal in what you 
accept." I think that this only applies to str.decode(). Any other 
attempt to create non-characters, such as unichr( 0x ), *should* 
raise an exception because clearly the programmer is making a mistake.

Other than that: +1 on fixing this case.
Cool!
Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread M.-A. Lemburg
Evan Jones wrote:
> I recently rediscovered this strange behaviour in Python's Unicode
> handling. I *think* it is a bug, but before I go and try to hack
> together a patch, I figure I should run it by the experts here on
> Python-Dev. If you understand Unicode, please let me know if there are
> problems with making these minor changes.
> 
> 
 import codecs
 codecs.BOM_UTF8.decode( "utf8" )
> u'\ufeff'
 codecs.BOM_UTF16.decode( "utf16" )
> u''
> 
> Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder
> turns it into a character? 

The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).

It is not needed for the UTF-8 because that format doesn't rely on
the byte order and the BOM character at the beginning of a stream is
a legitimate ZWNBSP (zero width non breakable space) code point.

The "utf-16" codec detects and removes the mark, while the
two others "utf-16-le" (little endian byte order) and "utf-16-be"
(big endian byte order) don't.

> The UTF-16 decoder contains logic to
> correctly handle the BOM. It even handles byte swapping, if necessary. I
> propose that  the UTF-8 decoder should have the same logic: it should
> remove the BOM if it is detected at the beginning of a string. 

-1; there's no standard for UTF-8 BOMs - adding it to the
codecs module was probably a mistake to begin with. You usually
only get UTF-8 files with BOM marks as the result of recoding
UTF-16 files into UTF-8.

> This will
> remove a bit of manual work for Python programs that deal with UTF-8
> files created on Windows, which frequently have the BOM at the
> beginning. The Unicode standard is unclear about how it should be
> handled (version 4, section 15.9):
> 
>> Although there are never any questions of byte order with UTF-8 text,
>> this sequence can serve as signature for UTF-8 encoded text where the
>> character set is unmarked. [...] Systems that use the byte order mark
>> must recognize when an initial U+FEFF signals the byte order. In those
>> cases, it is not part of the textual content and should be removed
>> before processing, because otherwise it may be mistaken for a
>> legitimate zero width no-break space.
> 
> 
> At the very least, it would be nice to add a note about this to the
> documentation, and possibly add this example function that implements
> the "UTF-8 or ASCII?" logic:
> 
> def autodecode( s ):
> if s.beginswith( codecs.BOM_UTF8 ):
> # The byte string s is UTF-8
> out = s.decode( "utf8" )
> return out[1:]
> else: return s.decode( "ascii" )

Well, I'd say that's a very English way of dealing with encoded
text ;-)

BTW, how do you know that s came from the start of a file
and not from slicing some already loaded file somewhere
in the middle ?

> As a second issue, the UTF-16LE and UTF-16BE encoders almost do the
> right thing: They turn the BOM into a character, just like the Unicode
> specification says they should.
> 
 codecs.BOM_UTF16_LE.decode( "utf-16le" )
> u'\ufeff'
 codecs.BOM_UTF16_BE.decode( "utf-16be" )
> u'\ufeff'
> 
> However, they also *incorrectly* handle the reversed byte order mark:
> 
 codecs.BOM_UTF16_BE.decode( "utf-16le" )
> u'\ufffe'
> 
> This is *not* a valid Unicode character. The Unicode specification
> (version 4, section 15.8) says the following about non-characters:
> 
>> Applications are free to use any of these noncharacter code points
>> internally but should never attempt to exchange them. If a
>> noncharacter is received in open interchange, an application is not
>> required to interpret it in any way. It is good practice, however, to
>> recognize it as a noncharacter and to take appropriate action, such as
>> removing it from the text. Note that Unicode conformance freely allows
>> the removal of these characters. (See C10 in Section3.2, Conformance
>> Requirements.)
> 
> 
> My interpretation of the specification means that Python should silently
> remove the character, resulting in a zero length Unicode string.
> Similarly, both of the following lines should also result in a zero
> length Unicode string:
> 
 '\xff\xfe\xfe\xff'.decode( "utf16" )
> u'\ufffe'
 '\xff\xfe\xff\xff'.decode( "utf16" )
> u'\u'

Hmm, wouldn't it be better to raise an error ? After all,
a reversed BOM mark in the stream looks a lot like you're
trying to decode a UTF-16 stream assuming the wrong
byte order ?!

Other than that: +1 on fixing this case.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 01 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 

[Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
I recently rediscovered this strange behaviour in Python's Unicode 
handling. I *think* it is a bug, but before I go and try to hack 
together a patch, I figure I should run it by the experts here on 
Python-Dev. If you understand Unicode, please let me know if there are 
problems with making these minor changes.

>>> import codecs
>>> codecs.BOM_UTF8.decode( "utf8" )
u'\ufeff'
>>> codecs.BOM_UTF16.decode( "utf16" )
u''
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder 
turns it into a character? The UTF-16 decoder contains logic to 
correctly handle the BOM. It even handles byte swapping, if necessary. 
I propose that  the UTF-8 decoder should have the same logic: it should 
remove the BOM if it is detected at the beginning of a string. This 
will remove a bit of manual work for Python programs that deal with 
UTF-8 files created on Windows, which frequently have the BOM at the 
beginning. The Unicode standard is unclear about how it should be 
handled (version 4, section 15.9):

Although there are never any questions of byte order with UTF-8 text, 
this sequence can serve as signature for UTF-8 encoded text where the 
character set is unmarked. [...] Systems that use the byte order mark 
must recognize when an initial U+FEFF signals the byte order. In those 
cases, it is not part of the textual content and should be removed 
before processing, because otherwise it may be mistaken for a 
legitimate zero width no-break space.
At the very least, it would be nice to add a note about this to the 
documentation, and possibly add this example function that implements 
the "UTF-8 or ASCII?" logic:

def autodecode( s ):
if s.beginswith( codecs.BOM_UTF8 ):
# The byte string s is UTF-8
out = s.decode( "utf8" )
return out[1:]
else: return s.decode( "ascii" )
As a second issue, the UTF-16LE and UTF-16BE encoders almost do the 
right thing: They turn the BOM into a character, just like the Unicode 
specification says they should.

>>> codecs.BOM_UTF16_LE.decode( "utf-16le" )
u'\ufeff'
>>> codecs.BOM_UTF16_BE.decode( "utf-16be" )
u'\ufeff'
However, they also *incorrectly* handle the reversed byte order mark:
>>> codecs.BOM_UTF16_BE.decode( "utf-16le" )
u'\ufffe'
This is *not* a valid Unicode character. The Unicode specification 
(version 4, section 15.8) says the following about non-characters:

Applications are free to use any of these noncharacter code points 
internally but should never attempt to exchange them. If a 
noncharacter is received in open interchange, an application is not 
required to interpret it in any way. It is good practice, however, to 
recognize it as a noncharacter and to take appropriate action, such as 
removing it from the text. Note that Unicode conformance freely allows 
the removal of these characters. (See C10 in Section3.2, Conformance 
Requirements.)
My interpretation of the specification means that Python should 
silently remove the character, resulting in a zero length Unicode 
string. Similarly, both of the following lines should also result in a 
zero length Unicode string:

>>> '\xff\xfe\xfe\xff'.decode( "utf16" )
u'\ufffe'
>>> '\xff\xfe\xff\xff'.decode( "utf16" )
u'\u'
Thanks for your feedback,
Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com