Re: [whatwg] Encodings and the web

Anne van Kesteren Sun, 08 Jan 2012 06:32:59 -0800

On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui <[email protected]> wrote:

= Legacy multi-octet Chinese (traditional) encodings


Mozilla supports another Big5 variants, Big5-UAO.
http://bugs.ruby-lang.org/issues/1784

As part of the big5 encoding, right? It sounds like it's a good idea toadopt that. I don't think there's much concern about table size thesedays, though obviously the less complexity the better.

= Legacy multi-octet Japanese encodings

The jis code point for a given number is: ...
The jis0208 index for a given octet is:


I wonder about this description.
I should explain the concept of JIS X 0208.

The most important thing is that JIS X 0208 is on the context of ISO2022.

Its target is ISO/IEC 2022 double byte 94 characters set.
It means its code space is 94 x 94.
http://en.wikipedia.org/wiki/JIS_X_0208

At the top, there is kuten numbers.
"ku" is row, expressed by the first one of double byte code.
"ten" is cell, expressed by the second one of doubye byte code.
So kuten number expresses a code-point.
Both ku and ten is an integer from 1 to 94.
For example Hiragana Character A, its kuten number is 04-01.

ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
ISO-2022-JP's double bytes are:
 first:  ku  + 0x20
 second: ten + 0x20
EUC-JP's double bytes are:
 first:  ku  + 0xA0
 second: ten + 0xA0
Shift_JIS's double bytes are:
 first:  if    1 <= ku <= 62 then (ku-1) / 2 + 0x81
         elif 63 <= ku <= 94 then (ku-1) / 2 + 0xC1
 second: if ku is even
           if    1 <= ku <= 63 then ten + 0x3F
           elif 64 <= ku <= 94 then ten + 0x40
         elif ku is odd then ten + 0x9E


So theoretically, we should make a conversion table between
kuten numbers and Unicode scalar values.

But as you know, "JIS X 0208" in web context should be Windows Code Page932,

extended by Microsoft.
http://msdn.microsoft.com/en-us/goglobal/cc305152
It is defined by Shift_JIS.

The jis0212 index for a given octet is:


As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
How treat X0212 in this Encoding spec will be a problem.

Yeah so currently I used Gecko's approach (roughly) towards Japaneseencodings, including how they put both 0208 and 0212 in a single longisharray. But maybe instead I should write it down as it has been done byUnicode.org, with double-octet sequence mapping to a Unicode character.Suggestions welcome.

With respect to 0212, it's not that hard to support it and given how longit has been deployed this way it's probably safer to keep it there I think.

== iso-2022-jp
=== The to Unicode algorithm
==== Based on iso-2022-jp state
===== ASCII state
====== Based on octet:
======= Otherwise

If the fatal flag is set, return failure.
Otherwise, emit the fallback code point.


Just FYI, IE and Opera show these bytes as Katakana.

If octet is greater than 0xA0 and less than 0xE0, value is octet +0xFEC0.


Moreover IE shows any shift_jis characters here.
It seems that IE uses the same converter both iso-2022-jp and shift_jis.

I have filed a bug on Opera to become more strict like Webkit/Gecko. Ifthere is some evidence that approach is wrong though, we can turn itaround.



--
Anne van Kesteren
http://annevankesteren.nl/

Re: [whatwg] Encodings and the web

Reply via email to