Re: Unicode and chunk expressions

Dar Scott Tue, 17 May 2005 15:52:19 -0700


On May 17, 2005, at 3:25 PM, Richard Gaskin wrote:

Forgive my ignorance, but how can UTF8 be used with two-byte systems like Chinese? I was under the impression those had to be UTF16.

Unicode is universal in that characters from many languages, language families, special use domains are all mapped onto the same numerical space. Unless you need to import or export files in some particular encoding format, you don't need specialized encoding methods.

Each Unicode character is 32 bits. Almost all the one you are likely to use are in the lower 16. The number associated with the character (not the glyph) is the code point.

The representation of a sequence of characters in a sequence of 32-bit, 16-bit or 8 bit values is called an encoding form. It does not lose information. It just packs it. The encoding form is what you would consider when working in the computer. Those encoding forms are UTF-32, UTF-16 and UTF-8. Note that the byte order within each value is not specified.

However, those byte-orders have to be specified if these are viewed as bytes (or Transcript chars) or you are writing to a file. There you need UTF-32BE (big endian), UTF-32LE, UTF-16BE, UTF-16LE and UTF-8. The order is not needed on UTF-8. These are called encoding schemes.

All unicode characters are packed into UTF-8 or UTF-16.

For UTF-16, you only have the rare (and reasonably ignored) characters outside the BMP, the basic range. Those are handled by special double values.

For UTF-8 the encoding is very clever. All characters in the ASCII code range (7 bits) are represented by bytes with the high bit zero. All others are represented by a sequence of bytes of which the high two bits are 11 for the first byte of the sequence and all the others are 10. Also it is possible to determine the number of bytes for that character from the first byte. You can read this backwards, too, so if Transcript goes to UTF-8, you can get char -1.

Since all the characters outside the ASCII range are represented by one to 4 bytes with the high bit set, you can never get a false lf or space or comma. Also, '=' only considers ASCII letters in case, so you never get any false lever conversions for comparison. "is a number" works with the usual Transcript numerals. UTF-8 has no nulls if there is no null character, so you can use it as a key to an array.

There may be ways folks will fool you by putting a dot over a comma or space (if possible), but usually the comma and space work just the way you expect. Oh, I forgot to say that tab and lf are part of the ASCII range.

I don't know how word thinks about characters with the high bit set, but I bet it thinks those are just more characters outside of white space, so those should work in words, even if they use some special codes that are special spaces.

I would expect the compiler is the same way, so a special editor can compile unicode string constants into UTF-8.

UTF-8 is a "language" in uniDecode and uniEncode, so you can convert easily.

Note that when I mention UTF-16, the normal form we get from "the unicodeText", I always emphasize "host-order", though that is redundant in a sense. The order depends on the OS. Because we can access those one byte at a time, we must then know that one is UTF-16BE and another might be UTF-16LE.

I think it is handicapping to think of "wide characters" or "two-byte systems".

Dar

--
**********************************************
    DSC (Dar Scott Consulting & Dar's Lab)
    http://www.swcp.com/dsc/
    Programming and software
**********************************************

_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Unicode and chunk expressions

Reply via email to