My collection of test pages and of surveys of fonts and programs is becoming
too popular for my ISP's free Web space, so I am moving it to a proper URL
on a faster server. The new address is:
http://www.alanwood.net/unicode/
Please update any links or bookmarks you may have for the old
Hello, experts!
Every time I read the following passage in
http://www.unicode.org/unicode/uni2book/ch03.pdf
I get confused:
- A single abstract character may correspond to more then one code
value - ...
- Multiple code values may be required to represent a single abstract
character. For
Антон Тагунов [EMAIL PROTECTED] wrote regarding Definition D5:
Every time I read the following passage in
http://www.unicode.org/unicode/uni2book/ch03.pdf
I get confused:
- A single abstract character may correspond to more then one code
value - ...
- Multiple code values may be
The last time I read the Unicode standard UTF-16 was big endian
unless a BOM was present, and that's what I expected from a UTF-16
converter.
Conformance requirement C2 (TUS 3.0, p. 37) says:
[And other many good references where TUS does *not* say that :)]
OK, maybe in 2.0, or I made
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this
seems to be something that the _application_ has to decide, not the _converter_ that
the application instantiates.
This converter name is (currently) only a convenience alias for use the UTF-16 byte
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM
is that this seems to be something that the _application_ has to decide,
not the _converter_ that the application instantiates.
This converter name is (currently) only a convenience alias for use the
UTF-16 byte
Антон Тагунов [EMAIL PROTECTED] wrote regarding Definition D5:
Every time I read the following passage in
http://www.unicode.org/unicode/uni2book/ch03.pdf
I get confused:
- A single abstract character may correspond to more then one code
value - ...
- Multiple code values may
Rick Cameron wrote:
So the original statement was correct. If the file starts with FF FE, it
must be a little-endian encoding; but you can't tell whether it's UTF-16 or
UTF-32.
If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE
is unambiguous.
If you
So the original statement was correct. If the file starts with FF
FE,
it must be a little-endian encoding; but you can't tell whether it's
UTF-16 or UTF-32.
The original statement was:
A Unicode text file beginning with FEFF is
big-endian, and a file beginning with FFFE (not a legal
If you look for any Unicode signature, then you look for FF
FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE).
FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM
followed by a UTF-16 U+. Yes, the NULL is usually not thought of as text,
but there's no knowing
Yves wrote, in response to Doug:
The last time I read the Unicode standard UTF-16 was big endian
unless a BOM was present, and that's what I expected from a UTF-16
converter.
Conformance requirement C2 (TUS 3.0, p. 37) says:
The Unicode Standard does not specify any order of
Here is what I think the FAQ ought to say:
Suppose you know that the text is Unicode.
- Unicode can be represented in a number of different forms (UTFs)
- some of them *may* start with a BOM (a byte sequence that would
correspond to U+FEFF).
- some cannot (in that case, a byte sequence that
D43 italUTF-16 character encoding scheme:/ital the Unicode
CES that serializes a UTF-16 code unit sequence as a byte sequence
in either big-endian or little-endian format.
* In UTF-16 (the CES), the UTF-16 code unit sequence
004D 0430 4E8C D800 DF02 is serialized as
FE FF 00 4D
And of course, I have been complaining about ICU's UTF-16 converter
behavior, but glibc's one does the same assumption that UTF-16 is in the
local endianness:
gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii
iconv: illegal input sequence at position 0
gabier%
So fixing one but
Yves,
So same semantics as before.
Yep. The editorial committee would't be doing its job right
if it were changing the semantics of the standard. The intent
here is to rewrite everything so that the semantics intended
all along will finally be revealed to everyone!
It really is a little like
So same semantics as before.
Yep. The editorial committee would't be doing its job right
if it were changing the semantics of the standard.
Agreed! Is there any mention that the non-BOM byte sequence is most
significant byte first anywhere else? You know, for the newbies?
Joshua 1.8
16 matches
Mail list logo