Re: [whatwg] Encoding: big5 and big5-hkscs

Øistein E . Andersen Sat, 07 Apr 2012 07:05:20 -0700

On Fri Apr 6 14:03:22 PDT 2012, Philip Jägenstedt <philipj at opera.com> wrote:

> So, <http://people.opera.com/philipj/2012/04/06/big5-foolip.txt> is the  
> mapping I suggest, with 18594 defined mappings and 1188 U+FFFD.

(Second byte 0xA1 appears as 0x7F in the mapping file.)

Your table is very similar to my idea of an ideal Big5-HKSCS decoder, which 
follows Unihan, official Big5-HKSCS mappings and real implementations closely.  
My version is documented in detail at 
<http://coq.no/character-tables/chinese-traditional/en> and summarised at 
<http://coq.no/character-tables/charset5/en>.

Only 2 mappings differ, viz, C6CD (U+2F33 v. U+5E7A) and F9FE (U+FFED v. 
U+2593), which is quite reassuring given that we have worked independently and 
used somewhat different approaches.  The two divergent mappings and related 
issues are discussed below.

***

> [in summary:]
> C6CF => U+5EF4 廴 [v.] U+2F35 ⼵
> C6D3 => U+65E0 无 [v.] U+2F46 ⽆
> C6D5 => U+7676 癶 [v.] U+2F68 ⽨
> C6D7 => U+96B6 隶 [v.] U+2FAA ⾪
> C6DE => U+3003 〃[v.] U+F6EE [PUA]
> C6DF => U+4EDD 仝 [v.]  U+F6EF [PUA]

To this list can be added:

C6CD => U+5E7A v. U+2F33

These seven characters are all part of the E-Ten 1 extension [1] to Big5, which 
is included in all implementations of Big5 (with or without HK extensions) that 
I have come across in browsers .  The official Big5-HKSCS table includes the 
E-Ten 1 extension as well, but the seven characters listed above appear 
elsewhere in the HKSCS extensions and are handled specially to avoid encoding 
the same character more than once.

        [1] <http://coq.no/character-tables/eten1.pdf>  
<http://coq.no/character-tables/eten1.js>

Five are Kangxi radicals and encoded twice in Unicode (once as radicals, once 
as normal Han characters).  The official Big5-HKSCS table maps C6CD to the 
Unicode Kangxi radical U+2F33 and does not list the remaining four codepoints 
at all.  U+2F33 is the only Unicode Kangxi radical included in the official 
Big5-HKSCS table.  As you have noticed, some Big5-HKSCS implementations follow 
this idea for the remaining four as well.  It seems better to follow non-HK 
Big5 implementations here and map all five to normal Unicode Han characters.

For the last two characters, U+3003 and U+4EDD, there is only one possible 
Unicode mapping, so duplicates are impossible to avoid (without using PUA 
characters).  The official Big5-HKSCS table does not map C6DE and C6DF to 
anything.

Suggested change:  map C6CD to U+5E7A.

***

On Fri Apr 6 06:42:26 PDT 2012, Philip Jägenstedt <philipj at opera.com> wrote:

> Also, a single mapping fails the Big5-contra[di]ction test:
> 
> F9FE =>
> opera-hk: U+FFED ￭
> firefox: U+2593 ▓
> chrome: U+2593 ▓
> firefox-hk: U+2593 ▓
> opera: U+2593 ▓
> chrome-hk: U+FFED ￭
> internetexplorer: U+2593 ▓
> hkscs-2008: <U+FFED> ￭
> 
> I'd say that we should go with U+FFED here, since that's what the 
> [HKSCS-2008] spec  
> says and it's visually close anyway.

Given that the goal is to define a unified Big5 (non-HK) and Big5-HKSCS 
encoding and that this seems to be a case of the HK standard going against 
everything and everyone else, perhaps more weight should be given to existing 
specifications and (non-HK-specific) implementations.

Suggested change:  map F9FE to U+2593

***

Duplicates and reverse mappings:

big5-foolip.txt currently provides two different codepoints for 100 Unicode 
characters.

6 are mentioned above.  84 result from compatibility mappings defined in the 
official HKSCS-2008 specification, cf. [2].  This leaves 10:

        [2] <http://coq.no/character-tables/o-h-comp.pdf> 
<http://coq.no/character-tables/o-h-comp.js>

U+5341 (十, 'ten') and U+5345 (卅 'thirty') are encoded twice in Big5, once as 
numerals and once as standard Han characters.  (U+5344 卄 'twenty' is only 
encoded once in Big5, but was added to HKSCS and is now one of the 84 
compatibility mappings.)

According to Lunde, the four codepoints F9FA--F9FD (ETen-2 extension) are 
supposed to encode double-stroked circle segments which appear to be missing 
from Unicode (I am not sure whether they have ever been proposed for 
inclusion).  They are currently mapped to single-stroked circle segments 
instead, but those are already encoded at A2E7, A2A1--A2A3 (original Big5).

The four codepoints F9F9, F9E9, F9EB, F9EA (ETen-2 extension) encode 
line-drawing characters with a double horizontal line.  These appear to be 
encoded at A2A4--A2A7 (original Big5) already, and it is not clear to me 
whether the characters at A2A4--A2A7 are supposed to be different or whether 
ETen chose to encode them again to have a full set of line-drawing characters 
in one location.

Suggested reverse mappings:

C6CF <= U+5EF4
C6D3 <= U+65E0
C6D5 <= U+7676
C6D7 <= U+96B6
C6DE <= U+3003
C6DF <= U+4EDD
C6CD <= 5E7A (if the mapping of C6CD is changed)
Rationale:  Only these mappings will work for non-HK Big5 implementations, and 
these characters appear to be important not only in Hong Kong.

A451 <= U+5341
A4CA <= U+5345

A27E <= U+256D
A2A1<= U+256E
A2A3 <= U+256F
A2A2 <= U+2570

F9F9 <= U+2550
F9E9 <= U+255E
F9EB <= U+2561
F9EA <= U+256A

(The 84 compatibility mappings should obviously only be used to decode and 
never as reverse mappings.)

***

On Fri Apr 6 14:03:22 PDT 2012, Philip Jägenstedt <philipj at opera.com> wrote:

> There are 29 mappings to U+003F (?) in IE that no other browser has.

Are you referring to the ones at A3E2--A3FE?  IE decodes (or used to decode) 
the control pictures at A3C0--A3E0 as C0 control characters in plain text, but 
replace(s) them with question marks in HTML.  It looks like this treatment has 
been extended to the the remaining A3xx codepoints (after the euro), perhaps 
without a good reason.

> The remaining mappings are to PUA or U+FFFD in all browsers [...]. Mapping  
> these to U+FFFD unless anyone finds pages using these byte sequences seems  
> the only sane option.

Agreed.  Do any of these ever render in a meaningful way (e.g., in IE on a 
Windows machine with HK locale and appropriate HKSCS PUA fonts)?

The following 22 codepoints are 'reserved for backwards compatibility' in the 
HKSCS-2008 standard, but no Unicode mappings are provided:

9EAC
9EC4
9EF4
9F4E
9FAD
9FB1
9FC0
9FC8
9FDA
9FE6
9FEA
9FEF
A054
A057
A05A
A062
A072
A0A5
A0AD
A0AF
A0D3
A0E1

I assume some systems will render at least these as potentially meaningful Han 
characters.

-- 
Øistein E. Andersen

Re: [whatwg] Encoding: big5 and big5-hkscs

Reply via email to