date:20020410

Alan Wood's Unicode Resources is moving

2002-04-10 Thread Alan Wood

My collection of test pages and of surveys of fonts and programs is becoming too popular for my ISP's free Web space, so I am moving it to a proper URL on a faster server. The new address is: http://www.alanwood.net/unicode/ Please update any links or bookmarks you may have for the old

Discrepancy in ch03.pdf?

2002-04-10 Thread Anton Tagunov

Hello, experts! Every time I read the following passage in http://www.unicode.org/unicode/uni2book/ch03.pdf I get confused: - A single abstract character may correspond to more then one code value - ... - Multiple code values may be required to represent a single abstract character. For

Re: Discrepancy in ch03.pdf?

2002-04-10 Thread Doug Ewell

Антон Тагунов [EMAIL PROTECTED] wrote regarding Definition D5: Every time I read the following passage in http://www.unicode.org/unicode/uni2book/ch03.pdf I get confused: - A single abstract character may correspond to more then one code value - ... - Multiple code values may be

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. Conformance requirement C2 (TUS 3.0, p. 37) says: [And other many good references where TUS does *not* say that :)] OK, maybe in 2.0, or I made

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer

The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Yves Arrouye

The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

Re: Discrepancy in ch03.pdf?

2002-04-10 Thread Kenneth Whistler

Антон Тагунов [EMAIL PROTECTED] wrote regarding Definition D5: Every time I read the following passage in http://www.unicode.org/unicode/uni2book/ch03.pdf I get confused: - A single abstract character may correspond to more then one code value - ... - Multiple code values may

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer

Rick Cameron wrote: So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE is unambiguous. If you

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis

So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. The original statement was: A Unicode text file beginning with FEFF is big-endian, and a file beginning with FFFE (not a legal

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread jarkko . hietaniemi

If you look for any Unicode signature, then you look for FF FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE). FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM followed by a UTF-16 U+. Yes, the NULL is usually not thought of as text, but there's no knowing

RE: Default endianness of Unicode, or not

2002-04-10 Thread Kenneth Whistler

Yves wrote, in response to Doug: The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. Conformance requirement C2 (TUS 3.0, p. 37) says: The Unicode Standard does not specify any order of

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis

Here is what I think the FAQ ought to say: Suppose you know that the text is Unicode. - Unicode can be represented in a number of different forms (UTFs) - some of them *may* start with a BOM (a byte sequence that would correspond to U+FEFF). - some cannot (in that case, a byte sequence that

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

D43 italUTF-16 character encoding scheme:/ital the Unicode CES that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format. * In UTF-16 (the CES), the UTF-16 code unit sequence 004D 0430 4E8C D800 DF02 is serialized as FE FF 00 4D

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

And of course, I have been complaining about ICU's UTF-16 converter behavior, but glibc's one does the same assumption that UTF-16 is in the local endianness: gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii iconv: illegal input sequence at position 0 gabier% So fixing one but

RE: Default endianness of Unicode, or not

2002-04-10 Thread Kenneth Whistler

Yves, So same semantics as before. Yep. The editorial committee would't be doing its job right if it were changing the semantics of the standard. The intent here is to rewrite everything so that the semantics intended all along will finally be revealed to everyone! It really is a little like

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

So same semantics as before. Yep. The editorial committee would't be doing its job right if it were changing the semantics of the standard. Agreed! Is there any mention that the non-BOM byte sequence is most significant byte first anywhere else? You know, for the newbies? Joshua 1.8

Alan Wood's Unicode Resources is moving

Discrepancy in ch03.pdf?

Re: Discrepancy in ch03.pdf?

RE: Default endianness of Unicode, or not

Re: MS/Unix BOM FAQ again (small fix)

RE: MS/Unix BOM FAQ again (small fix)

Re: Discrepancy in ch03.pdf?

Re: MS/Unix BOM FAQ again (small fix)

Re: MS/Unix BOM FAQ again (small fix)

RE: MS/Unix BOM FAQ again (small fix)

RE: Default endianness of Unicode, or not

Re: MS/Unix BOM FAQ again (small fix)

RE: Default endianness of Unicode, or not

RE: Default endianness of Unicode, or not

RE: Default endianness of Unicode, or not

RE: Default endianness of Unicode, or not

16 matches

Site Navigation

Mail list logo

Footer information