Re: [whatwg] A comment to character encoding declaration

2008-05-22 Thread Ian Hickson
On Thu, 22 May 2008, Henri Sivonen wrote:
> On May 22, 2008, at 12:23, Ian Hickson wrote:
> > 
> >   EUC-KR -> Windows-949
> >   KS_C_5601-1987 -> Windows-949
> 
> FWIW, x-windows-949 would be more correct given the current IANA situation.

Should I just changed the spec to strip leading "x-"s? That would deal 
with our Big5 problem too, as well as:

> The list is missing [...] x-iso-8859-11


> After pondering the usefulness of conformance errors in this area, I'm 
> inclined to think that there should be no particular errors when in 
> coding name aliasing happens. This means that I would even suggest 
> removing the C1 range bytes as errors when ISO-8859-1 turns into 
> Windows-1252. My rationale is that the cost/benefit characteristics of 
> reporting theoretical wrongness in this area are unfavorable.

See earlier mail today on this topic.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] A comment to character encoding declaration

2008-05-22 Thread Henri Sivonen

On May 22, 2008, at 12:23, Ian Hickson wrote:


  EUC-KR -> Windows-949
  KS_C_5601-1987 -> Windows-949


FWIW, x-windows-949 would be more correct given the current IANA  
situation.


The list is missing tis-620, x-iso-8859-11 and iso-8859-11 which  
should turn into x-windows-874.


Let me know if you have any more information, e.g. an exact list of  
what should be a conformance error in each of those

cases.


After pondering the usefulness of conformance errors in this area, I'm  
inclined to think that there should be no particular errors when in  
coding name aliasing happens. This means that I would even suggest  
removing the C1 range bytes as errors when ISO-8859-1 turns into  
Windows-1252. My rationale is that the cost/benefit characteristics of  
reporting theoretical wrongness in this area are unfavorable.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] A comment to character encoding declaration

2008-05-22 Thread Ian Hickson
On Fri, 7 Mar 2008, Alexey Proskuryakov wrote:
> On Mar 3, 2008, at 6:11 PM, Jjgod Jiang wrote:
> > [...] I think we can suggest clients to simply treat encodings like 
> > these as their biggest superset, for instance, treat GB2312 as 
> > GB18030.
> 
> In my testing, it appears that IE 7 and Firefox 2 do treat GBK as an 
> equivalent of GB2312, but this cannot be said about GB18030. In 
> particular, 0x80 and 0xA2E3 are treated differently.

On Wed, 19 Mar 2008, Henri Sivonen wrote:
> 
> According to source code[1], WebKit trunk also changes GB_2312-80 to 
> GBK. Gecko aliases gb_2312-80 to GB2312 (due to FrontPage output 
> according to source comment).
> 
> Also, WebKit changes KS_C_5601-1987 and EUC-KR to windows-949-2000. 
> Gecko aliases[2] KS_C_5601-1987 to x-windows-949 (due to FrontPage 
> output according to source comment). However, Gecko doesn't use its 
> alias mechanism to alias EUC-KR to windows-949. I haven't tested if 
> EUC-KR is treated equivalently to windows-949 by other means.
> 
> Yet another weird alias tidbit supported both by Gecko and WebKit source 
> as well as Googling the subject: Looks like x-x-big5 needs to be an 
> alias for Big5 due to FrontPage output.
> 
> [1] 
> http://trac.webkit.org/projects/webkit/browser/trunk/WebCore/platform/text/TextCodecICU.cpp#L90
> [2] 
> http://mxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties#335

So what I'm reading from the above (and other similar e-mails not quoted 
above) is that we should introduce the following mappings:

   GB2312 -> GBK
   GB_2312-80 -> GBK
   EUC-KR -> Windows-949
   KS_C_5601-1987 -> Windows-949
   x-x-big5 -> Big5

Is that correct?

I've added this to the spec. Let me know if you have any more information, 
e.g. an exact list of what should be a conformance error in each of those 
cases. Also, if you have any useful references for GB2312 and Big5, let me 
know, I couldn't find anything to reference for them.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] A comment to character encoding declaration

2008-03-19 Thread Henri Sivonen

On Mar 7, 2008, at 10:12, Jjgod Jiang wrote:

On Fri, 7 Mar 2008, Alexey Proskuryakov wrote:
 In my testing, it appears that IE 7 and Firefox 2 do treat GBK as  
an equivalent of GB2312, but this cannot be said about GB18030. In  
particular, 0x80 and 0xA2E3 are treated differently.


Yep, I missed that point in my previous post, my fault. Yes, they
should be treated differently. So I guess my request should be changed
to only treat GB2312 as GBK.


According to source code[1], WebKit trunk also changes GB_2312-80 to  
GBK. Gecko aliases gb_2312-80 to GB2312 (due to FrontPage output  
according to source comment).


Also, WebKit changes KS_C_5601-1987 and EUC-KR to windows-949-2000.  
Gecko aliases[2] KS_C_5601-1987 to x-windows-949 (due to FrontPage  
output according to source comment). However, Gecko doesn't use its  
alias mechanism to alias EUC-KR to windows-949. I haven't tested if  
EUC-KR is treated equivalently to windows-949 by other means.


Yet another weird alias tidbit supported both by Gecko and WebKit  
source as well as Googling the subject:
Looks like x-x-big5 needs to be an alias for Big5 due to FrontPage  
output.


[1] 
http://trac.webkit.org/projects/webkit/browser/trunk/WebCore/platform/text/TextCodecICU.cpp#L90
[2] 
http://mxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetalias.properties#335
--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] A comment to character encoding declaration

2008-03-07 Thread Jjgod Jiang

Hi Alexey,

On Fri, 7 Mar 2008, Alexey Proskuryakov wrote:
  In my testing, it appears that IE 7 and Firefox 2 do treat GBK as 
an equivalent of GB2312, but this cannot be said about GB18030. In 
particular, 0x80 and 0xA2E3 are treated differently.


Yep, I missed that point in my previous post, my fault. Yes, they
should be treated differently. So I guess my request should be changed
to only treat GB2312 as GBK.


  See:



  What differences are you seeing between Firefox and WebKit? It 
seems that the behavior may be a bit more tricky than just treating 
all encodings from GBK family as GB18030.


On Safari 3.0.4, only 0x80 is recognized as euro mark in gbk.html,
and only 0xA2E3 is recognized as euro mark in gb18030.html. But on
Firefox 3.0 (Gecko/2008030604 nightly build), both 0x80 and 0xA2E3
is recognized as euro mark in gb18030.html. So there seems to be
some inconsistencies here, and I think you're right, simply treat
all GBK family encodings as GB18030 is not a good idea.

- Jiang


Re: [whatwg] A comment to character encoding declaration

2008-03-06 Thread Alexey Proskuryakov


On Mar 3, 2008, at 6:11 PM, Jjgod Jiang wrote:


in their header, yet they might use characters in GBK but
not in GB2312. So, I think we can suggest clients to simply
treat encodings like these as their biggest superset, for
instance, treat GB2312 as GB18030.

BTW, browsers like Firefox seems already handles such cases
well, but Safari/WebKit seems not.



  In my testing, it appears that IE 7 and Firefox 2 do treat GBK as  
an equivalent of GB2312, but this cannot be said about GB18030. In  
particular, 0x80 and 0xA2E3 are treated differently.


  See:



  What differences are you seeing between Firefox and WebKit? It  
seems that the behavior may be a bit more tricky than just treating  
all encodings from GBK family as GB18030.


- WBR, Alexey Proskuryakov



Re: [whatwg] A comment to character encoding declaration

2008-03-05 Thread Philip Taylor
On 03/03/2008, Jjgod Jiang <[EMAIL PROTECTED]> wrote:
>  During the development of CJK information processing, many
>  text encodings is just a strict subset of another one, for
>  example, GB2312 is a subset of GBK, GBK is a subset of
>  GB18030. For compatibility purpose, a lot of web pages used
>  character encoding declaration like this:
>
>  
>
>  in their header, yet they might use characters in GBK but
>  not in GB2312. So, I think we can suggest clients to simply
>  treat encodings like these as their biggest superset, for
>  instance, treat GB2312 as GB18030.

Out of 130K pages from dmoz.org, I see 760 which are declared as
gb2312 (by HTTP Content-Type, , etc).

Of those 760, 120 cause decoding errors in ICU4J when treated as
gb2312. 8 cause errors when treated as gbk, and the same 8 cause
errors as gb18030.

Those 8 are:
http://www.bigm.com.cn/dinosaur/anecdote/
http://www.ccpc.edu.cn
http://www.gdoverseaschn.com.cn/
http://www.jgbr.com.cn
http://www.liechebuluo.com
http://www.netbro.com.cn
http://www.tkdts.com
http://www.wuxi-accp.com/
and I haven't tried working out why they are causing errors.

The 120 are listed at
. I don't know how
many are really using gb18030, and how many are not actually gb* but
happen to be decoded without errors because they use compatible byte
sequences; but it does look like gb2312 is a fairly significant
problem if it's not treated as gbk/gb18030, so it would be helpful to
suggest/require it to be processed specially.

-- 
Philip Taylor
[EMAIL PROTECTED]


[whatwg] A comment to character encoding declaration

2008-03-03 Thread Jjgod Jiang

Hi,

It's a comment to the "character encoding declaration"
section of HTML 5 spec:

http://www.w3.org/html/wg/html5/#character1

During the development of CJK information processing, many
text encodings is just a strict subset of another one, for
example, GB2312 is a subset of GBK, GBK is a subset of
GB18030. For compatibility purpose, a lot of web pages used
character encoding declaration like this:



in their header, yet they might use characters in GBK but
not in GB2312. So, I think we can suggest clients to simply
treat encodings like these as their biggest superset, for
instance, treat GB2312 as GB18030.

BTW, browsers like Firefox seems already handles such cases
well, but Safari/WebKit seems not.

Regards,
Jiang