Re: [whatwg] Encoding: big5 and big5-hkscs

Philip Jägenstedt Sun, 08 Apr 2012 10:04:16 -0700

On Sat, 07 Apr 2012 16:04:55 +0200, Øistein E. Andersen <[email protected]>wrote:

On Fri Apr 6 14:03:22 PDT 2012, Philip Jägenstedt <philipj at opera.com>wrote:

So, <http://people.opera.com/philipj/2012/04/06/big5-foolip.txt> is the
mapping I suggest, with 18594 defined mappings and 1188 U+FFFD.


(Second byte 0xA1 appears as 0x7F in the mapping file.)

Oops, I blame Anne's get_bytes() which had an off-by-one error. This isthe corrected version:


def get_bytes(index):
    row = 0xFE-0xA1 + RANGE + 1
    lead = (index // row) + 0x81
    cell = index % row
    trail = (cell + 0xA1 - RANGE) if cell >= RANGE else cell + 0x40
    return (lead, trail)

I've also updated big5-foolip.txt with this fix.

Your table is very similar to my idea of an ideal Big5-HKSCS decoder,which follows Unihan, official Big5-HKSCS mappings and realimplementations closely. My version is documented in detail at<http://coq.no/character-tables/chinese-traditional/en> and summarisedat <http://coq.no/character-tables/charset5/en>.
Only 2 mappings differ, viz, C6CD (U+2F33 v. U+5E7A) and F9FE (U+FFED v.U+2593), which is quite reassuring given that we have workedindependently and used somewhat different approaches. The two divergentmappings and related issues are discussed below.

Yes, it's very encouraging indeed that we got this close independently andwith different methods!

[in summary:]
C6CF => U+5EF4 廴 [v.] U+2F35 ⼵
C6D3 => U+65E0 无 [v.] U+2F46 ⽆
C6D5 => U+7676 癶 [v.] U+2F68 ⽨
C6D7 => U+96B6 隶 [v.] U+2FAA ⾪
C6DE => U+3003 〃[v.] U+F6EE [PUA]
C6DF => U+4EDD 仝 [v.]  U+F6EF [PUA]
To this list can be added:

C6CD => U+5E7A v. U+2F33
These seven characters are all part of the E-Ten 1 extension [1] toBig5, which is included in all implementations of Big5 (with or withoutHK extensions) that I have come across in browsers . The officialBig5-HKSCS table includes the E-Ten 1 extension as well, but the sevencharacters listed above appear elsewhere in the HKSCS extensions and arehandled specially to avoid encoding the same character more than once.
[1] <http://coq.no/character-tables/eten1.pdf><http://coq.no/character-tables/eten1.js>

What is the source for the mappings in eten1.pdf? I assume that E-Ten wasoriginally just some Big5 fonts with no defined mappings to Unicode?

Five are Kangxi radicals and encoded twice in Unicode (once as radicals,once as normal Han characters). The official Big5-HKSCS table maps C6CDto the Unicode Kangxi radical U+2F33 and does not list the remainingfour codepoints at all. U+2F33 is the only Unicode Kangxi radicalincluded in the official Big5-HKSCS table. As you have noticed, someBig5-HKSCS implementations follow this idea for the remaining four aswell. It seems better to follow non-HK Big5 implementations here andmap all five to normal Unicode Han characters.
For the last two characters, U+3003 and U+4EDD, there is only onepossible Unicode mapping, so duplicates are impossible to avoid (withoutusing PUA characters). The official Big5-HKSCS table does not map C6DEand C6DF to anything.
Suggested change:  map C6CD to U+5E7A.


These are the existing mappings:

C6CD =>
opera-hk: U+2F33 ⼳
firefox: U+5E7A 幺
chrome: U+F6DD 
firefox-hk: U+5E7A 幺
opera: U+2F33 ⼳
chrome-hk: U+2F33 ⼳
internetexplorer: U+F6DD 
hkscs-2008: <U+2F33> ⼳

At least on the Web, this isn't a question of HK vs non-HK mappings. Otherthan Firefox, which (de-facto) specs or implementations use U+5E7A?

Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, butit's not the only hanzi in HKSCS-2008 that normalizes into something else:


8BC3 => <U+2F878> 屮 => <U+5C6E> 屮
8BF8 => <U+F907> 龜 => <U+9F9C> 龜
8EFD => <U+2F994> 芳 => <U+82B3> 芳
8FA8 => <U+2F9B2> 䕫 => <U+456B> 䕫
8FF0 => <U+2F9D4> 貫 => <U+8CAB> 貫
C6CD => <U+2F33> ⼳ => <U+5E7A> 幺
957A => <U+2F9BC> 蜨 => <U+8728> 蜨
9874 => <U+2F825> 勇 => <U+52C7> 勇
9AC8 => <U+2F83B> 吆 => <U+5406> 吆
9C52 => <U+2F8CD> 晉 => <U+6649> 晉
A047 => <U+2F840> 咢 => <U+54A2> 咢
FC48 => <U+2F894> 弢 => <U+5F22> 弢
FC77 => <U+2F8A6> 慈 => <U+6148> 慈

I'm not sure what the conclusion is...

On Fri Apr 6 06:42:26 PDT 2012, Philip Jägenstedt <philipj at opera.com>wrote:
Also, a single mapping fails the Big5-contra[di]ction test:

F9FE =>
opera-hk: U+FFED ￭
firefox: U+2593 ▓
chrome: U+2593 ▓
firefox-hk: U+2593 ▓
opera: U+2593 ▓
chrome-hk: U+FFED ￭
internetexplorer: U+2593 ▓
hkscs-2008: <U+FFED> ￭
I'd say that we should go with U+FFED here, since that's what the[HKSCS-2008] spec
says and it's visually close anyway.
Given that the goal is to define a unified Big5 (non-HK) and Big5-HKSCSencoding and that this seems to be a case of the HK standard goingagainst everything and everyone else, perhaps more weight should begiven to existing specifications and (non-HK-specific) implementations.
Suggested change:  map F9FE to U+2593

This is the only mapping where IE maps something other than PUA or "?"that my mapping doesn't agree on, so I don't object to changing it. Still,it would be very interesting to know why HKSCS-2008 changed it, do youknow?

Duplicates and reverse mappings:
big5-foolip.txt currently provides two different codepoints for 100Unicode characters.
6 are mentioned above. 84 result from compatibility mappings defined inthe official HKSCS-2008 specification, cf. [2]. This leaves 10:
[2] <http://coq.no/character-tables/o-h-comp.pdf><http://coq.no/character-tables/o-h-comp.js>
U+5341 (十, 'ten') and U+5345 (卅 'thirty') are encoded twice in Big5,once as numerals and once as standard Han characters. (U+5344 卄'twenty' is only encoded once in Big5, but was added to HKSCS and is nowone of the 84 compatibility mappings.)
According to Lunde, the four codepoints F9FA--F9FD (ETen-2 extension)are supposed to encode double-stroked circle segments which appear to bemissing from Unicode (I am not sure whether they have ever been proposedfor inclusion). They are currently mapped to single-stroked circlesegments instead, but those are already encoded at A2E7, A2A1--A2A3(original Big5).
The four codepoints F9F9, F9E9, F9EB, F9EA (ETen-2 extension) encodeline-drawing characters with a double horizontal line. These appear tobe encoded at A2A4--A2A7 (original Big5) already, and it is not clear tome whether the characters at A2A4--A2A7 are supposed to be different orwhether ETen chose to encode them again to have a full set ofline-drawing characters in one location.
Suggested reverse mappings:

C6CF <= U+5EF4
C6D3 <= U+65E0
C6D5 <= U+7676
C6D7 <= U+96B6
C6DE <= U+3003
C6DF <= U+4EDD
C6CD <= 5E7A (if the mapping of C6CD is changed)
Rationale: Only these mappings will work for non-HK Big5implementations, and these characters appear to be important not only inHong Kong.
A451 <= U+5341
A4CA <= U+5345

A27E <= U+256D
A2A1<= U+256E
A2A3 <= U+256F
A2A2 <= U+2570

F9F9 <= U+2550
F9E9 <= U+255E
F9EB <= U+2561
F9EA <= U+256A
(The 84 compatibility mappings should obviously only be used to decodeand never as reverse mappings.)

Anne, how do you plan to define encoders for tables with duplicatemappings? Have you collected data for what browsers currently do?

In any event, it clearly needs to be defined what to do for these 100 codepoints that have multiple mappings to Big5. I extended my Python script tofind these 100 duplicates and to check what Python did for 'big5', fallingback to 'big5-hkscs'. This is what it produced:


8FB6 <= U+880F
90C4 <= U+96B6
91BE <= U+9F17
9242 <= U+8503
9361 <= U+5F0C
9455 <= U+7250
947A <= U+7468
96EE <= U+701E
9975 <= U+732A
9CE4 <= U+975D
9DEF <= U+5605
9DFB <= U+5ED0
A05F <= U+936E
A0D4 <= U+89A9
A0DC <= U+60A4
A1B2 <= U+3003
A259 <= U+5159
A25A <= U+515B
A25B <= U+515E
A25C <= U+515D
A260 <= U+74E9
A261 <= U+7CCE
A27E <= U+256D
A2A1 <= U+256E
A2A2 <= U+2570
A2A3 <= U+256F
A2A4 <= U+2550
A2A5 <= U+255E
A2A6 <= U+256A
A2A7 <= U+2561
A2CD <= U+5344
A451 <= U+5341
A4CA <= U+5345
A55D <= U+5305
A7FB <= U+675E
A9E4 <= U+62D0
A9F0 <= U+62CE
AACC <= U+8005
ABEC <= U+6062
ADC5 <= U+5029
ADEB <= U+537F
AFB0 <= U+79E3
B05F <= U+8D77
B0B0 <= U+507D
B3A3 <= U+90FD
B440 <= U+5A77
B4B8 <= U+6674
B4E4 <= U+6E2F
B4FC <= U+6E1D
B54E <= U+716E
B5AE <= U+7B51
B5D7 <= U+83C1
B7EC <= U+745C
B9B0 <= U+50ED
BAE6 <= U+7BB8
BAFC <= U+7DD2
BCB5 <= U+6490
BF47 <= U+6FB6
BFA6 <= U+7E1D
BFAE <= U+8028
BFCC <= U+89A6
C052 <= U+975C
C0E7 <= U+71DF
C554 <= U+97FF
C5F7 <= U+77D7
C95C <= U+5C10
C969 <= U+4EDD
C9DB <= U+5E75
C9FC <= U+6C4A
CA52 <= U+9097
CB58 <= U+6C9C
CDE7 <= U+4FBB
CFF1 <= U+7809
D0C0 <= U+91D4
D256 <= U+6D67
D4D1 <= U+5A67
D8F4 <= U+5F58
DB5D <= U+83CF
DB79 <= U+840F
DC52 <= U+9104
DE72 <= U+7162
DECD <= U+75F9
E07C <= U+8F0B
E3C8 <= U+84A8
E6AB <= U+7479
E6D0 <= U+799B
E8CD <= U+99D6
E959 <= U+5B28
EBC9 <= U+8F36
EDCA <= U+7C06
EFF9 <= U+7201
F1E3 <= U+9F16
F5E8 <= U+7E87
F86D <= U+9DF0
F9C4 <= U+9B2E
F9D7 <= U+92B9
FBFD <= U+5EF4
FCD3 <= U+65E0
FD64 <= U+60DE
FEC1 <= U+7676

These are the ones where you (Øistein) disagree:

C6CF <= U+5EF4
C6D3 <= U+65E0
C6D5 <= U+7676
C6D7 <= U+96B6

AFAICT this has nothing to do with compatibility mappings, so what's thereason for this?

F9E9 <= U+255E
F9EA <= U+256A
F9EB <= U+2561
F9F9 <= U+2550


Python's big5-hkscs agrees, but Python's big5 does this instead:

A2A5 <= U+255E
A2A6 <= U+256A
A2A7 <= U+2561
A2A4 <= U+2550

It seems safer to go with the big5 mappings, but checking what browsers dowould be helpful.


How about the rest of my generated list, is that fine?

On Fri Apr 6 14:03:22 PDT 2012, Philip Jägenstedt <philipj at opera.com>wrote:
There are 29 mappings to U+003F (?) in IE that no other browser has.
Are you referring to the ones at A3E2--A3FE? IE decodes (or used todecode) the control pictures at A3C0--A3E0 as C0 control characters inplain text, but replace(s) them with question marks in HTML. It lookslike this treatment has been extended to the the remaining A3xxcodepoints (after the euro), perhaps without a good reason.


Yes, that's the range. I think we should leave these undefined.

The remaining mappings are to PUA or U+FFFD in all browsers [...].Mappingthese to U+FFFD unless anyone finds pages using these byte sequencesseems
the only sane option.
Agreed. Do any of these ever render in a meaningful way (e.g., in IE ona Windows machine with HK locale and appropriate HKSCS PUA fonts)?
The following 22 codepoints are 'reserved for backwards compatibility'in the HKSCS-2008 standard, but no Unicode mappings are provided:
9EAC
9EC4
9EF4
9F4E
9FAD
9FB1
9FC0
9FC8
9FDA
9FE6
9FEA
9FEF
A054
A057
A05A
A062
A072
A0A5
A0AD
A0AF
A0D3
A0E1
I assume some systems will render at least these as potentiallymeaningful Han characters.

I generated<http://people.opera.com/philipj/2012/04/08/big5-undefined-ie.txt> and hada look using various Chinese fonts in Windows 7. It looks like most fontshave a copy of the printable ASCII characters in U+F020 through U+F07E,and what looks like parts of windows-1252 or latin-1 up to U+F0FF.

Exactly the 22 codepoints you list *are* Han characters in theMingLiu_HKSCS font, see<http://people.opera.com/philipj/2012/04/08/big5-mingliu-hkscs.png>.Presumably they were not in Unicode when HKSCS-2008 was defined, but ifthey have been added since I think we should simply map them.Unfortunately, I haven't been able to find them by searching by radicalsin the Unihan database...


--
Philip Jägenstedt
Core Developer
Opera Software

Re: [whatwg] Encoding: big5 and big5-hkscs

Reply via email to