RE: japanese xml

Peter_Constable Thu, 30 Aug 2001 22:30:27 -0700

Marco:

>> Furthermore, Viranga's context appears to be XML, in which
>> case it *is* possible to encode *all* Unicode code points
>> using EUC (or ISO-8859-1 or ASCII or ...)
>
>Yes, yes. XML documents can represent characters in at least two ways:

>2) By representing them with numeric references in the form "Ӓ" etc...

>In the context of Unicode and, more generally, plain-text encoding "to
>encode" means only point 1 above, and "&1234;" is just a six-character
>string. BTW, this is also the interpretation of tools (text editor, etc.)
>used to manipulate XML files -- so it is not a pointless distinction for
>someone working in XML.
>
>Point 2, in Unicode speech, is defined a "higher level protocol",

I agree with you earlier, but on the other hand, suppose we define UTF-NCR8:

Unicode bit code code code code code
scalar value pattern unit 1 unit 2 unit 3 unit 4 unit 5

0020 - 0027 00wwwwww 00100110 00100011 00110011 0011xxxx 00111011
where xxxx = wwwwww - 11101 (binary)

0028 - 0031 00wwwwww 00100110 00100011 00110100 0011xxxx 00111011
where xxxx = wwwwww - 100111 (binary)

0032 - 003b 00wwwwww 00100110 00100011 00110101 0011xxxx 00111011
where xxxx = wwwwww - 110001 (binary)

etc., but with a handful of exceptions, such as

U+0026: 00100110 01100001 01101101 01110000 00111011

U+003C: 00100110 01101100 01110100 00111011

We can also define UTF-NCR16 in just the same way, but the code units are 16-bit, zero-extended equivalents of the UTF-NCR8 code unites. One of the interesting aspects of these encodings is that XML parsers understand them without requiring that the charset be declared, just like UTF-8 and UTF-16.

Now, if someone interpreted Misha to mean one of these encodings, then he would be talking about encoding in the same sense as you. :-)

Peter

RE: japanese xml

Reply via email to