On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui <[email protected]> wrote:
= Legacy multi-octet Chinese (traditional) encodings
Mozilla supports another Big5 variants, Big5-UAO.
http://bugs.ruby-lang.org/issues/1784
As part of the big5 encoding, right? It sounds like it's a good idea to
adopt that. I don't think there's much concern about table size these
days, though obviously the less complexity the better.
= Legacy multi-octet Japanese encodings
The jis code point for a given number is: ...
The jis0208 index for a given octet is:
I wonder about this description.
I should explain the concept of JIS X 0208.
The most important thing is that JIS X 0208 is on the context of ISO
2022.
Its target is ISO/IEC 2022 double byte 94 characters set.
It means its code space is 94 x 94.
http://en.wikipedia.org/wiki/JIS_X_0208
At the top, there is kuten numbers.
"ku" is row, expressed by the first one of double byte code.
"ten" is cell, expressed by the second one of doubye byte code.
So kuten number expresses a code-point.
Both ku and ten is an integer from 1 to 94.
For example Hiragana Character A, its kuten number is 04-01.
ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
ISO-2022-JP's double bytes are:
first: ku + 0x20
second: ten + 0x20
EUC-JP's double bytes are:
first: ku + 0xA0
second: ten + 0xA0
Shift_JIS's double bytes are:
first: if 1 <= ku <= 62 then (ku-1) / 2 + 0x81
elif 63 <= ku <= 94 then (ku-1) / 2 + 0xC1
second: if ku is even
if 1 <= ku <= 63 then ten + 0x3F
elif 64 <= ku <= 94 then ten + 0x40
elif ku is odd then ten + 0x9E
So theoretically, we should make a conversion table between
kuten numbers and Unicode scalar values.
But as you know, "JIS X 0208" in web context should be Windows Code Page
932,
extended by Microsoft.
http://msdn.microsoft.com/en-us/goglobal/cc305152
It is defined by Shift_JIS.
The jis0212 index for a given octet is:
As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
How treat X0212 in this Encoding spec will be a problem.
Yeah so currently I used Gecko's approach (roughly) towards Japanese
encodings, including how they put both 0208 and 0212 in a single longish
array. But maybe instead I should write it down as it has been done by
Unicode.org, with double-octet sequence mapping to a Unicode character.
Suggestions welcome.
With respect to 0212, it's not that hard to support it and given how long
it has been deployed this way it's probably safer to keep it there I think.
== iso-2022-jp
=== The to Unicode algorithm
==== Based on iso-2022-jp state
===== ASCII state
====== Based on octet:
======= Otherwise
If the fatal flag is set, return failure.
Otherwise, emit the fallback code point.
Just FYI, IE and Opera show these bytes as Katakana.
If octet is greater than 0xA0 and less than 0xE0, value is octet +
0xFEC0.
Moreover IE shows any shift_jis characters here.
It seems that IE uses the same converter both iso-2022-jp and shift_jis.
I have filed a bug on Opera to become more strict like Webkit/Gecko. If
there is some evidence that approach is wrong though, we can turn it
around.
--
Anne van Kesteren
http://annevankesteren.nl/