Alan Wood's Unicode Resources is moving

2002-04-10 Thread Alan Wood
My collection of test pages and of surveys of fonts and programs is becoming too popular for my ISP's free Web space, so I am moving it to a proper URL on a faster server. The new address is: http://www.alanwood.net/unicode/ Please update any links or bookmarks you may have for the old

Discrepancy in ch03.pdf?

2002-04-10 Thread Anton Tagunov
Hello, experts! Every time I read the following passage in http://www.unicode.org/unicode/uni2book/ch03.pdf I get confused: - A single abstract character may correspond to more then one code value - ... - Multiple code values may be required to represent a single abstract character. For

Re: Discrepancy in ch03.pdf?

2002-04-10 Thread Doug Ewell
Антон Тагунов [EMAIL PROTECTED] wrote regarding Definition D5: Every time I read the following passage in http://www.unicode.org/unicode/uni2book/ch03.pdf I get confused: - A single abstract character may correspond to more then one code value - ... - Multiple code values may be

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. Conformance requirement C2 (TUS 3.0, p. 37) says: [And other many good references where TUS does *not* say that :)] OK, maybe in 2.0, or I made

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Yves Arrouye
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

Re: Discrepancy in ch03.pdf?

2002-04-10 Thread Kenneth Whistler
Антон Тагунов [EMAIL PROTECTED] wrote regarding Definition D5: Every time I read the following passage in http://www.unicode.org/unicode/uni2book/ch03.pdf I get confused: - A single abstract character may correspond to more then one code value - ... - Multiple code values may

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
Rick Cameron wrote: So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE is unambiguous. If you

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis
So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. The original statement was: A Unicode text file beginning with FEFF is big-endian, and a file beginning with FFFE (not a legal

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread jarkko . hietaniemi
If you look for any Unicode signature, then you look for FF FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE). FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM followed by a UTF-16 U+. Yes, the NULL is usually not thought of as text, but there's no knowing

RE: Default endianness of Unicode, or not

2002-04-10 Thread Kenneth Whistler
Yves wrote, in response to Doug: The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. Conformance requirement C2 (TUS 3.0, p. 37) says: The Unicode Standard does not specify any order of

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis
Here is what I think the FAQ ought to say: Suppose you know that the text is Unicode. - Unicode can be represented in a number of different forms (UTFs) - some of them *may* start with a BOM (a byte sequence that would correspond to U+FEFF). - some cannot (in that case, a byte sequence that

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
D43 italUTF-16 character encoding scheme:/ital the Unicode CES that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format. * In UTF-16 (the CES), the UTF-16 code unit sequence 004D 0430 4E8C D800 DF02 is serialized as FE FF 00 4D

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
And of course, I have been complaining about ICU's UTF-16 converter behavior, but glibc's one does the same assumption that UTF-16 is in the local endianness: gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii iconv: illegal input sequence at position 0 gabier% So fixing one but

RE: Default endianness of Unicode, or not

2002-04-10 Thread Kenneth Whistler
Yves, So same semantics as before. Yep. The editorial committee would't be doing its job right if it were changing the semantics of the standard. The intent here is to rewrite everything so that the semantics intended all along will finally be revealed to everyone! It really is a little like

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
So same semantics as before. Yep. The editorial committee would't be doing its job right if it were changing the semantics of the standard. Agreed! Is there any mention that the non-BOM byte sequence is most significant byte first anywhere else? You know, for the newbies? Joshua 1.8