On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui <[email protected]> wrote:
= Legacy multi-octet Chinese (traditional) encodings

Mozilla supports another Big5 variants, Big5-UAO.
http://bugs.ruby-lang.org/issues/1784

As part of the big5 encoding, right? It sounds like it's a good idea to adopt that. I don't think there's much concern about table size these days, though obviously the less complexity the better.


= Legacy multi-octet Japanese encodings

The jis code point for a given number is: ...
The jis0208 index for a given octet is:

I wonder about this description.
I should explain the concept of JIS X 0208.

The most important thing is that JIS X 0208 is on the context of ISO 2022.
Its target is ISO/IEC 2022 double byte 94 characters set.
It means its code space is 94 x 94.
http://en.wikipedia.org/wiki/JIS_X_0208

At the top, there is kuten numbers.
"ku" is row, expressed by the first one of double byte code.
"ten" is cell, expressed by the second one of doubye byte code.
So kuten number expresses a code-point.
Both ku and ten is an integer from 1 to 94.
For example Hiragana Character A, its kuten number is 04-01.

ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
ISO-2022-JP's double bytes are:
 first:  ku  + 0x20
 second: ten + 0x20
EUC-JP's double bytes are:
 first:  ku  + 0xA0
 second: ten + 0xA0
Shift_JIS's double bytes are:
 first:  if    1 <= ku <= 62 then (ku-1) / 2 + 0x81
         elif 63 <= ku <= 94 then (ku-1) / 2 + 0xC1
 second: if ku is even
           if    1 <= ku <= 63 then ten + 0x3F
           elif 64 <= ku <= 94 then ten + 0x40
         elif ku is odd then ten + 0x9E


So theoretically, we should make a conversion table between
kuten numbers and Unicode scalar values.

But as you know, "JIS X 0208" in web context should be Windows Code Page 932,
extended by Microsoft.
http://msdn.microsoft.com/en-us/goglobal/cc305152
It is defined by Shift_JIS.

The jis0212 index for a given octet is:

As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
How treat X0212 in this Encoding spec will be a problem.

Yeah so currently I used Gecko's approach (roughly) towards Japanese encodings, including how they put both 0208 and 0212 in a single longish array. But maybe instead I should write it down as it has been done by Unicode.org, with double-octet sequence mapping to a Unicode character. Suggestions welcome.

With respect to 0212, it's not that hard to support it and given how long it has been deployed this way it's probably safer to keep it there I think.


== iso-2022-jp
=== The to Unicode algorithm
==== Based on iso-2022-jp state
===== ASCII state
====== Based on octet:
======= Otherwise
If the fatal flag is set, return failure.
Otherwise, emit the fallback code point.

Just FYI, IE and Opera show these bytes as Katakana.
If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0.

Moreover IE shows any shift_jis characters here.
It seems that IE uses the same converter both iso-2022-jp and shift_jis.

I have filed a bug on Opera to become more strict like Webkit/Gecko. If there is some evidence that approach is wrong though, we can turn it around.


--
Anne van Kesteren
http://annevankesteren.nl/

Reply via email to