Re: [whatwg] Encodings and the web

2012-01-08 Thread Anne van Kesteren

On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote:

= Legacy multi-octet Chinese (traditional) encodings

Mozilla supports another Big5 variants, Big5-UAO.
http://bugs.ruby-lang.org/issues/1784


As part of the big5 encoding, right? It sounds like it's a good idea to  
adopt that. I don't think there's much concern about table size these  
days, though obviously the less complexity the better.




= Legacy multi-octet Japanese encodings


The jis code point for a given number is: ...
The jis0208 index for a given octet is:


I wonder about this description.
I should explain the concept of JIS X 0208.

The most important thing is that JIS X 0208 is on the context of ISO  
2022.

Its target is ISO/IEC 2022 double byte 94 characters set.
It means its code space is 94 x 94.
http://en.wikipedia.org/wiki/JIS_X_0208

At the top, there is kuten numbers.
ku is row, expressed by the first one of double byte code.
ten is cell, expressed by the second one of doubye byte code.
So kuten number expresses a code-point.
Both ku and ten is an integer from 1 to 94.
For example Hiragana Character A, its kuten number is 04-01.

ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
ISO-2022-JP's double bytes are:
 first:  ku  + 0x20
 second: ten + 0x20
EUC-JP's double bytes are:
 first:  ku  + 0xA0
 second: ten + 0xA0
Shift_JIS's double bytes are:
 first:  if1 = ku = 62 then (ku-1) / 2 + 0x81
 elif 63 = ku = 94 then (ku-1) / 2 + 0xC1
 second: if ku is even
   if1 = ku = 63 then ten + 0x3F
   elif 64 = ku = 94 then ten + 0x40
 elif ku is odd then ten + 0x9E


So theoretically, we should make a conversion table between
kuten numbers and Unicode scalar values.

But as you know, JIS X 0208 in web context should be Windows Code Page  
932,

extended by Microsoft.
http://msdn.microsoft.com/en-us/goglobal/cc305152
It is defined by Shift_JIS.


The jis0212 index for a given octet is:


As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
How treat X0212 in this Encoding spec will be a problem.


Yeah so currently I used Gecko's approach (roughly) towards Japanese  
encodings, including how they put both 0208 and 0212 in a single longish  
array. But maybe instead I should write it down as it has been done by  
Unicode.org, with double-octet sequence mapping to a Unicode character.  
Suggestions welcome.


With respect to 0212, it's not that hard to support it and given how long  
it has been deployed this way it's probably safer to keep it there I think.




== iso-2022-jp
=== The to Unicode algorithm
 Based on iso-2022-jp state
= ASCII state
== Based on octet:
=== Otherwise

If the fatal flag is set, return failure.
Otherwise, emit the fallback code point.


Just FYI, IE and Opera show these bytes as Katakana.
If octet is greater than 0xA0 and less than 0xE0, value is octet +  
0xFEC0.


Moreover IE shows any shift_jis characters here.
It seems that IE uses the same converter both iso-2022-jp and shift_jis.


I have filed a bug on Opera to become more strict like Webkit/Gecko. If  
there is some evidence that approach is wrong though, we can turn it  
around.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Encodings and the web

2012-01-08 Thread NARUSE, Yui
(2012/01/08 23:32), Anne van Kesteren wrote:
 On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote:
 = Legacy multi-octet Chinese (traditional) encodings

 Mozilla supports another Big5 variants, Big5-UAO.
 http://bugs.ruby-lang.org/issues/1784
 
 As part of the big5 encoding, right? It sounds like it's a good idea to adopt 
 that. I don't think there's much concern about table size these days, though 
 obviously the less complexity the better.

CC to the original reporter.
Could you cooperate about current situation in Taiwan?

 == iso-2022-jp
 === The to Unicode algorithm
  Based on iso-2022-jp state
 = ASCII state
 == Based on octet:
 === Otherwise
 If the fatal flag is set, return failure.
 Otherwise, emit the fallback code point.

 Just FYI, IE and Opera show these bytes as Katakana.
 If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0.

 Moreover IE shows any shift_jis characters here.
 It seems that IE uses the same converter both iso-2022-jp and shift_jis.
 
 I have filed a bug on Opera to become more strict like Webkit/Gecko. If there 
 is some evidence that approach is wrong though, we can turn it around.

There is a old variant of ISO-2022-JP called JIS8.
JIS8 is used before RFC1468 is written, and still used in some area,
for exapmle bank-to-bank information exchange.
JIS8's 8 means 8bit byte to express Katakana, which is just described above.

So I can't state it is a bug on Opera at this time.
It is depend on how many sites uses such 8bit Katakana.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Encodings and the web

2012-01-08 Thread NARUSE, Yui
Hi,

thank you for quick reply,

(2012/01/09 0:38), Lin Jen-Shin (godfat) wrote:
 On Sun, Jan 8, 2012 at 11:20 PM, NARUSE, Yui nar...@airemix.jp wrote:
 (2012/01/08 23:32), Anne van Kesteren wrote:
 On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote:
 = Legacy multi-octet Chinese (traditional) encodings

 Mozilla supports another Big5 variants, Big5-UAO.
 http://bugs.ruby-lang.org/issues/1784

 As part of the big5 encoding, right? It sounds like it's a good idea to 
 adopt that. I don't think there's much concern about table size these days, 
 though obviously the less complexity the better.

 CC to the original reporter.
 Could you cooperate about current situation in Taiwan?
 
 I am not sure what I can do here, but I would try my best to
 coordinate if there's anything I could help.
 
 So what are we trying to solve here, again?

This is the thread from
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034241.html

And discussing about a spec about Encoding on the web.
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

I'm interesting about whether web browsers other than Mozilla should implement
Big5-UAO or not.

Thanks,

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Encodings and the web

2012-01-07 Thread NARUSE, Yui
(2012/01/07 0:38), Anne van Kesteren wrote:
 On Thu, 22 Dec 2011 15:33:35 +0100, L. David Baron dba...@dbaron.org wrote:
 This seems like one of those areas where it may be substantially
 easier to figure out what implementations do by looking at their
 code than by reverse-engineering, at least for the implementations
 whose code is available publicly.

 Gecko's code lives in
 http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ .  There
 are others who know it substantially better, but I or others could
 probably answer questions you have about how it works and how to
 understand it.

 I'm not the right person for pointers to other implementations,
 though.
 
 Thanks, I'm doing a combination of code inspection, reverse engineering 
 (especially for edge cases), and applying some lessons we learned (e.g. 
 non-greedy error handling).
 
 So far I defined the to Unicode algorithms for hz-gb-2312, euc-jp, 
 iso-2022-jp, and shift_jis.

= Legacy multi-octet Chinese (traditional) encodings

Mozilla supports another Big5 variants, Big5-UAO.
http://bugs.ruby-lang.org/issues/1784

= Legacy multi-octet Japanese encodings

 The jis code point for a given number is: ...
 The jis0208 index for a given octet is:

I wonder about this description.
I should explain the concept of JIS X 0208.

The most important thing is that JIS X 0208 is on the context of ISO 2022.
Its target is ISO/IEC 2022 double byte 94 characters set.
It means its code space is 94 x 94.
http://en.wikipedia.org/wiki/JIS_X_0208

At the top, there is kuten numbers.
ku is row, expressed by the first one of double byte code.
ten is cell, expressed by the second one of doubye byte code.
So kuten number expresses a code-point.
Both ku and ten is an integer from 1 to 94.
For example Hiragana Character A, its kuten number is 04-01.

ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
ISO-2022-JP's double bytes are:
 first:  ku  + 0x20
 second: ten + 0x20
EUC-JP's double bytes are:
 first:  ku  + 0xA0
 second: ten + 0xA0
Shift_JIS's double bytes are:
 first:  if1 = ku = 62 then (ku-1) / 2 + 0x81
 elif 63 = ku = 94 then (ku-1) / 2 + 0xC1
 second: if ku is even
   if1 = ku = 63 then ten + 0x3F
   elif 64 = ku = 94 then ten + 0x40
 elif ku is odd then ten + 0x9E


So theoretically, we should make a conversion table between
kuten numbers and Unicode scalar values.

But as you know, JIS X 0208 in web context should be Windows Code Page 932,
extended by Microsoft.
http://msdn.microsoft.com/en-us/goglobal/cc305152
It is defined by Shift_JIS.

 The jis0212 index for a given octet is:

As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
How treat X0212 in this Encoding spec will be a problem.

== iso-2022-jp
=== The to Unicode algorithm
 Based on iso-2022-jp state
= ASCII state
== Based on octet:
=== Otherwise
 If the fatal flag is set, return failure.
 Otherwise, emit the fallback code point.

Just FYI, IE and Opera show these bytes as Katakana.
If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0.

Moreover IE shows any shift_jis characters here.
It seems that IE uses the same converter both iso-2022-jp and shift_jis.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Encodings and the web

2012-01-06 Thread Anne van Kesteren
On Thu, 22 Dec 2011 15:33:35 +0100, L. David Baron dba...@dbaron.org  
wrote:

This seems like one of those areas where it may be substantially
easier to figure out what implementations do by looking at their
code than by reverse-engineering, at least for the implementations
whose code is available publicly.

Gecko's code lives in
http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ .  There
are others who know it substantially better, but I or others could
probably answer questions you have about how it works and how to
understand it.

I'm not the right person for pointers to other implementations,
though.


Thanks, I'm doing a combination of code inspection, reverse engineering  
(especially for edge cases), and applying some lessons we learned (e.g.  
non-greedy error handling).


So far I defined the to Unicode algorithms for hz-gb-2312, euc-jp,  
iso-2022-jp, and shift_jis.


http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

Feedback welcome!


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Encodings and the web

2011-12-22 Thread L. David Baron
On Tuesday 2011-12-20 12:01 +0100, Anne van Kesteren wrote:
 If you are interested in helping out testing (and reverse engineering)
 multi-octet encodings please let me know. Any other input is much
 appreciated as well.

This seems like one of those areas where it may be substantially
easier to figure out what implementations do by looking at their
code than by reverse-engineering, at least for the implementations
whose code is available publicly.

Gecko's code lives in
http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ .  There
are others who know it substantially better, but I or others could
probably answer questions you have about how it works and how to
understand it.

I'm not the right person for pointers to other implementations,
though.

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla   http://www.mozilla.org/   턂


Re: [whatwg] Encodings and the web

2011-12-21 Thread Anne van Kesteren
On Wed, 21 Dec 2011 04:40:10 +0100, Mark Callow callow_m...@hicorp.co.jp  
wrote:

On 20/12/2011 20:01, Anne van Kesteren wrote:


[3]http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html


This is a great start. A few comments

It seems weird to use Windows' names rather than the iso names as the
official encoding names. E.g., I expected iso-8859-1 to be the encoding
and windows-1252 to be one of the labels.


Since the actual encoding used is closer to windows-1252 it seemed more  
accurate to me to do it the other way around (though for shift_jis I have  
not done that as everyone calls windows-31j shift_jis). It does affect  
what document.characterSet returns though so maybe we should switch it.




Notes still says multi-octet encodings aren't listed at all. Perhaps I
am misinterpreting what list of encodings refers to.


Oops, removed that. (Though not all multi-octet encodings are listed yet.)



Including tables for all the multi-octet encodings is going to be a big
task and create a very long document.  Such tables may be better placed
in linked documents rather than the main body.


Yeah I think we have to do that for some encodings. Others, such as UTF-8  
and UTF-16, can probably be defined inline.



--
Anne van Kesteren
http://annevankesteren.nl/


[whatwg] Encodings and the web

2011-12-20 Thread Anne van Kesteren

Hi,

When doing research into encodings as implemented by popular user agents I
have found the current standards lacking. In particular:

   * More encodings in the registry than needed for the web
   * Error handling for encodings is undefined (can lead to XSS exploits,
 also gives interoperability problems)
   * Often encodings are implemented differently from the standard

A year ago I did some research into encodings[1] and more detailed for
single-octet encodings[2] and I have now taken that further into starting
to define a standard[3] for encodings as they are to be implemented by
user agents. The current scope is roughly defining the encodings, their
labels and name, and how you match a label.

The goal is to unify encoding handling across user agents for the web so
legacy pages can be interpreted correctly (i.e. as expected by users).

If you are interested in helping out testing (and reverse engineering)
multi-octet encodings please let me know. Any other input is much
appreciated as well.

(I emailed this separately to ietf-charsets.)

Kind regards,


[1]http://wiki.whatwg.org/wiki/Web_Encodings
[2]http://annevankesteren.nl/2010/12/encodings-labels-tested
[3]http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Encodings and the web

2011-12-20 Thread Mark Callow

On 20/12/2011 20:01, Anne van Kesteren wrote:

 [3]http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

This is a great start. A few comments

It seems weird to use Windows' names rather than the iso names as the
official encoding names. E.g., I expected iso-8859-1 to be the encoding
and windows-1252 to be one of the labels.

Notes still says multi-octet encodings aren't listed at all. Perhaps I
am misinterpreting what list of encodings refers to.

Including tables for all the multi-octet encodings is going to be a big
task and create a very long document.  Such tables may be better placed
in linked documents rather than the main body.

Regards

-Mark