Re: [whatwg] Encodings and the web
On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote: = Legacy multi-octet Chinese (traditional) encodings Mozilla supports another Big5 variants, Big5-UAO. http://bugs.ruby-lang.org/issues/1784 As part of the big5 encoding, right? It sounds like it's a good idea to adopt that. I don't think there's much concern about table size these days, though obviously the less complexity the better. = Legacy multi-octet Japanese encodings The jis code point for a given number is: ... The jis0208 index for a given octet is: I wonder about this description. I should explain the concept of JIS X 0208. The most important thing is that JIS X 0208 is on the context of ISO 2022. Its target is ISO/IEC 2022 double byte 94 characters set. It means its code space is 94 x 94. http://en.wikipedia.org/wiki/JIS_X_0208 At the top, there is kuten numbers. ku is row, expressed by the first one of double byte code. ten is cell, expressed by the second one of doubye byte code. So kuten number expresses a code-point. Both ku and ten is an integer from 1 to 94. For example Hiragana Character A, its kuten number is 04-01. ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes. ISO-2022-JP's double bytes are: first: ku + 0x20 second: ten + 0x20 EUC-JP's double bytes are: first: ku + 0xA0 second: ten + 0xA0 Shift_JIS's double bytes are: first: if1 = ku = 62 then (ku-1) / 2 + 0x81 elif 63 = ku = 94 then (ku-1) / 2 + 0xC1 second: if ku is even if1 = ku = 63 then ten + 0x3F elif 64 = ku = 94 then ten + 0x40 elif ku is odd then ten + 0x9E So theoretically, we should make a conversion table between kuten numbers and Unicode scalar values. But as you know, JIS X 0208 in web context should be Windows Code Page 932, extended by Microsoft. http://msdn.microsoft.com/en-us/goglobal/cc305152 It is defined by Shift_JIS. The jis0212 index for a given octet is: As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212. https://bugzilla.mozilla.org/show_bug.cgi?id=600715 How treat X0212 in this Encoding spec will be a problem. Yeah so currently I used Gecko's approach (roughly) towards Japanese encodings, including how they put both 0208 and 0212 in a single longish array. But maybe instead I should write it down as it has been done by Unicode.org, with double-octet sequence mapping to a Unicode character. Suggestions welcome. With respect to 0212, it's not that hard to support it and given how long it has been deployed this way it's probably safer to keep it there I think. == iso-2022-jp === The to Unicode algorithm Based on iso-2022-jp state = ASCII state == Based on octet: === Otherwise If the fatal flag is set, return failure. Otherwise, emit the fallback code point. Just FYI, IE and Opera show these bytes as Katakana. If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0. Moreover IE shows any shift_jis characters here. It seems that IE uses the same converter both iso-2022-jp and shift_jis. I have filed a bug on Opera to become more strict like Webkit/Gecko. If there is some evidence that approach is wrong though, we can turn it around. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Encodings and the web
(2012/01/08 23:32), Anne van Kesteren wrote: On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote: = Legacy multi-octet Chinese (traditional) encodings Mozilla supports another Big5 variants, Big5-UAO. http://bugs.ruby-lang.org/issues/1784 As part of the big5 encoding, right? It sounds like it's a good idea to adopt that. I don't think there's much concern about table size these days, though obviously the less complexity the better. CC to the original reporter. Could you cooperate about current situation in Taiwan? == iso-2022-jp === The to Unicode algorithm Based on iso-2022-jp state = ASCII state == Based on octet: === Otherwise If the fatal flag is set, return failure. Otherwise, emit the fallback code point. Just FYI, IE and Opera show these bytes as Katakana. If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0. Moreover IE shows any shift_jis characters here. It seems that IE uses the same converter both iso-2022-jp and shift_jis. I have filed a bug on Opera to become more strict like Webkit/Gecko. If there is some evidence that approach is wrong though, we can turn it around. There is a old variant of ISO-2022-JP called JIS8. JIS8 is used before RFC1468 is written, and still used in some area, for exapmle bank-to-bank information exchange. JIS8's 8 means 8bit byte to express Katakana, which is just described above. So I can't state it is a bug on Opera at this time. It is depend on how many sites uses such 8bit Katakana. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Encodings and the web
Hi, thank you for quick reply, (2012/01/09 0:38), Lin Jen-Shin (godfat) wrote: On Sun, Jan 8, 2012 at 11:20 PM, NARUSE, Yui nar...@airemix.jp wrote: (2012/01/08 23:32), Anne van Kesteren wrote: On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote: = Legacy multi-octet Chinese (traditional) encodings Mozilla supports another Big5 variants, Big5-UAO. http://bugs.ruby-lang.org/issues/1784 As part of the big5 encoding, right? It sounds like it's a good idea to adopt that. I don't think there's much concern about table size these days, though obviously the less complexity the better. CC to the original reporter. Could you cooperate about current situation in Taiwan? I am not sure what I can do here, but I would try my best to coordinate if there's anything I could help. So what are we trying to solve here, again? This is the thread from http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034241.html And discussing about a spec about Encoding on the web. http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html I'm interesting about whether web browsers other than Mozilla should implement Big5-UAO or not. Thanks, -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Encodings and the web
(2012/01/07 0:38), Anne van Kesteren wrote: On Thu, 22 Dec 2011 15:33:35 +0100, L. David Baron dba...@dbaron.org wrote: This seems like one of those areas where it may be substantially easier to figure out what implementations do by looking at their code than by reverse-engineering, at least for the implementations whose code is available publicly. Gecko's code lives in http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ . There are others who know it substantially better, but I or others could probably answer questions you have about how it works and how to understand it. I'm not the right person for pointers to other implementations, though. Thanks, I'm doing a combination of code inspection, reverse engineering (especially for edge cases), and applying some lessons we learned (e.g. non-greedy error handling). So far I defined the to Unicode algorithms for hz-gb-2312, euc-jp, iso-2022-jp, and shift_jis. = Legacy multi-octet Chinese (traditional) encodings Mozilla supports another Big5 variants, Big5-UAO. http://bugs.ruby-lang.org/issues/1784 = Legacy multi-octet Japanese encodings The jis code point for a given number is: ... The jis0208 index for a given octet is: I wonder about this description. I should explain the concept of JIS X 0208. The most important thing is that JIS X 0208 is on the context of ISO 2022. Its target is ISO/IEC 2022 double byte 94 characters set. It means its code space is 94 x 94. http://en.wikipedia.org/wiki/JIS_X_0208 At the top, there is kuten numbers. ku is row, expressed by the first one of double byte code. ten is cell, expressed by the second one of doubye byte code. So kuten number expresses a code-point. Both ku and ten is an integer from 1 to 94. For example Hiragana Character A, its kuten number is 04-01. ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes. ISO-2022-JP's double bytes are: first: ku + 0x20 second: ten + 0x20 EUC-JP's double bytes are: first: ku + 0xA0 second: ten + 0xA0 Shift_JIS's double bytes are: first: if1 = ku = 62 then (ku-1) / 2 + 0x81 elif 63 = ku = 94 then (ku-1) / 2 + 0xC1 second: if ku is even if1 = ku = 63 then ten + 0x3F elif 64 = ku = 94 then ten + 0x40 elif ku is odd then ten + 0x9E So theoretically, we should make a conversion table between kuten numbers and Unicode scalar values. But as you know, JIS X 0208 in web context should be Windows Code Page 932, extended by Microsoft. http://msdn.microsoft.com/en-us/goglobal/cc305152 It is defined by Shift_JIS. The jis0212 index for a given octet is: As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212. https://bugzilla.mozilla.org/show_bug.cgi?id=600715 How treat X0212 in this Encoding spec will be a problem. == iso-2022-jp === The to Unicode algorithm Based on iso-2022-jp state = ASCII state == Based on octet: === Otherwise If the fatal flag is set, return failure. Otherwise, emit the fallback code point. Just FYI, IE and Opera show these bytes as Katakana. If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0. Moreover IE shows any shift_jis characters here. It seems that IE uses the same converter both iso-2022-jp and shift_jis. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Encodings and the web
On Thu, 22 Dec 2011 15:33:35 +0100, L. David Baron dba...@dbaron.org wrote: This seems like one of those areas where it may be substantially easier to figure out what implementations do by looking at their code than by reverse-engineering, at least for the implementations whose code is available publicly. Gecko's code lives in http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ . There are others who know it substantially better, but I or others could probably answer questions you have about how it works and how to understand it. I'm not the right person for pointers to other implementations, though. Thanks, I'm doing a combination of code inspection, reverse engineering (especially for edge cases), and applying some lessons we learned (e.g. non-greedy error handling). So far I defined the to Unicode algorithms for hz-gb-2312, euc-jp, iso-2022-jp, and shift_jis. http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html Feedback welcome! -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Encodings and the web
On Tuesday 2011-12-20 12:01 +0100, Anne van Kesteren wrote: If you are interested in helping out testing (and reverse engineering) multi-octet encodings please let me know. Any other input is much appreciated as well. This seems like one of those areas where it may be substantially easier to figure out what implementations do by looking at their code than by reverse-engineering, at least for the implementations whose code is available publicly. Gecko's code lives in http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ . There are others who know it substantially better, but I or others could probably answer questions you have about how it works and how to understand it. I'm not the right person for pointers to other implementations, though. -David -- 턞 L. David Baron http://dbaron.org/ 턂 턢 Mozilla http://www.mozilla.org/ 턂
Re: [whatwg] Encodings and the web
On Wed, 21 Dec 2011 04:40:10 +0100, Mark Callow callow_m...@hicorp.co.jp wrote: On 20/12/2011 20:01, Anne van Kesteren wrote: [3]http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html This is a great start. A few comments It seems weird to use Windows' names rather than the iso names as the official encoding names. E.g., I expected iso-8859-1 to be the encoding and windows-1252 to be one of the labels. Since the actual encoding used is closer to windows-1252 it seemed more accurate to me to do it the other way around (though for shift_jis I have not done that as everyone calls windows-31j shift_jis). It does affect what document.characterSet returns though so maybe we should switch it. Notes still says multi-octet encodings aren't listed at all. Perhaps I am misinterpreting what list of encodings refers to. Oops, removed that. (Though not all multi-octet encodings are listed yet.) Including tables for all the multi-octet encodings is going to be a big task and create a very long document. Such tables may be better placed in linked documents rather than the main body. Yeah I think we have to do that for some encodings. Others, such as UTF-8 and UTF-16, can probably be defined inline. -- Anne van Kesteren http://annevankesteren.nl/
[whatwg] Encodings and the web
Hi, When doing research into encodings as implemented by popular user agents I have found the current standards lacking. In particular: * More encodings in the registry than needed for the web * Error handling for encodings is undefined (can lead to XSS exploits, also gives interoperability problems) * Often encodings are implemented differently from the standard A year ago I did some research into encodings[1] and more detailed for single-octet encodings[2] and I have now taken that further into starting to define a standard[3] for encodings as they are to be implemented by user agents. The current scope is roughly defining the encodings, their labels and name, and how you match a label. The goal is to unify encoding handling across user agents for the web so legacy pages can be interpreted correctly (i.e. as expected by users). If you are interested in helping out testing (and reverse engineering) multi-octet encodings please let me know. Any other input is much appreciated as well. (I emailed this separately to ietf-charsets.) Kind regards, [1]http://wiki.whatwg.org/wiki/Web_Encodings [2]http://annevankesteren.nl/2010/12/encodings-labels-tested [3]http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Encodings and the web
On 20/12/2011 20:01, Anne van Kesteren wrote: [3]http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html This is a great start. A few comments It seems weird to use Windows' names rather than the iso names as the official encoding names. E.g., I expected iso-8859-1 to be the encoding and windows-1252 to be one of the labels. Notes still says multi-octet encodings aren't listed at all. Perhaps I am misinterpreting what list of encodings refers to. Including tables for all the multi-octet encodings is going to be a big task and create a very long document. Such tables may be better placed in linked documents rather than the main body. Regards -Mark