On Sat, 07 Apr 2012 16:04:55 +0200, Øistein E. Andersen <[email protected]>
wrote:
On Fri Apr 6 14:03:22 PDT 2012, Philip Jägenstedt <philipj at opera.com>
wrote:
So, <http://people.opera.com/philipj/2012/04/06/big5-foolip.txt> is the
mapping I suggest, with 18594 defined mappings and 1188 U+FFFD.
(Second byte 0xA1 appears as 0x7F in the mapping file.)
Oops, I blame Anne's get_bytes() which had an off-by-one error. This is
the corrected version:
def get_bytes(index):
row = 0xFE-0xA1 + RANGE + 1
lead = (index // row) + 0x81
cell = index % row
trail = (cell + 0xA1 - RANGE) if cell >= RANGE else cell + 0x40
return (lead, trail)
I've also updated big5-foolip.txt with this fix.
Your table is very similar to my idea of an ideal Big5-HKSCS decoder,
which follows Unihan, official Big5-HKSCS mappings and real
implementations closely. My version is documented in detail at
<http://coq.no/character-tables/chinese-traditional/en> and summarised
at <http://coq.no/character-tables/charset5/en>.
Only 2 mappings differ, viz, C6CD (U+2F33 v. U+5E7A) and F9FE (U+FFED v.
U+2593), which is quite reassuring given that we have worked
independently and used somewhat different approaches. The two divergent
mappings and related issues are discussed below.
Yes, it's very encouraging indeed that we got this close independently and
with different methods!
[in summary:]
C6CF => U+5EF4 廴 [v.] U+2F35 ⼵
C6D3 => U+65E0 无 [v.] U+2F46 ⽆
C6D5 => U+7676 癶 [v.] U+2F68 ⽨
C6D7 => U+96B6 隶 [v.] U+2FAA ⾪
C6DE => U+3003 〃[v.] U+F6EE [PUA]
C6DF => U+4EDD 仝 [v.] U+F6EF [PUA]
To this list can be added:
C6CD => U+5E7A v. U+2F33
These seven characters are all part of the E-Ten 1 extension [1] to
Big5, which is included in all implementations of Big5 (with or without
HK extensions) that I have come across in browsers . The official
Big5-HKSCS table includes the E-Ten 1 extension as well, but the seven
characters listed above appear elsewhere in the HKSCS extensions and are
handled specially to avoid encoding the same character more than once.
[1] <http://coq.no/character-tables/eten1.pdf>
<http://coq.no/character-tables/eten1.js>
What is the source for the mappings in eten1.pdf? I assume that E-Ten was
originally just some Big5 fonts with no defined mappings to Unicode?
Five are Kangxi radicals and encoded twice in Unicode (once as radicals,
once as normal Han characters). The official Big5-HKSCS table maps C6CD
to the Unicode Kangxi radical U+2F33 and does not list the remaining
four codepoints at all. U+2F33 is the only Unicode Kangxi radical
included in the official Big5-HKSCS table. As you have noticed, some
Big5-HKSCS implementations follow this idea for the remaining four as
well. It seems better to follow non-HK Big5 implementations here and
map all five to normal Unicode Han characters.
For the last two characters, U+3003 and U+4EDD, there is only one
possible Unicode mapping, so duplicates are impossible to avoid (without
using PUA characters). The official Big5-HKSCS table does not map C6DE
and C6DF to anything.
Suggested change: map C6CD to U+5E7A.
These are the existing mappings:
C6CD =>
opera-hk: U+2F33 ⼳
firefox: U+5E7A 幺
chrome: U+F6DD
firefox-hk: U+5E7A 幺
opera: U+2F33 ⼳
chrome-hk: U+2F33 ⼳
internetexplorer: U+F6DD
hkscs-2008: <U+2F33> ⼳
At least on the Web, this isn't a question of HK vs non-HK mappings. Other
than Firefox, which (de-facto) specs or implementations use U+5E7A?
Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but
it's not the only hanzi in HKSCS-2008 that normalizes into something else:
8BC3 => <U+2F878> 屮 => <U+5C6E> 屮
8BF8 => <U+F907> 龜 => <U+9F9C> 龜
8EFD => <U+2F994> 芳 => <U+82B3> 芳
8FA8 => <U+2F9B2> 䕫 => <U+456B> 䕫
8FF0 => <U+2F9D4> 貫 => <U+8CAB> 貫
C6CD => <U+2F33> ⼳ => <U+5E7A> 幺
957A => <U+2F9BC> 蜨 => <U+8728> 蜨
9874 => <U+2F825> 勇 => <U+52C7> 勇
9AC8 => <U+2F83B> 吆 => <U+5406> 吆
9C52 => <U+2F8CD> 晉 => <U+6649> 晉
A047 => <U+2F840> 咢 => <U+54A2> 咢
FC48 => <U+2F894> 弢 => <U+5F22> 弢
FC77 => <U+2F8A6> 慈 => <U+6148> 慈
I'm not sure what the conclusion is...
On Fri Apr 6 06:42:26 PDT 2012, Philip Jägenstedt <philipj at opera.com>
wrote:
Also, a single mapping fails the Big5-contra[di]ction test:
F9FE =>
opera-hk: U+FFED ■
firefox: U+2593 ▓
chrome: U+2593 ▓
firefox-hk: U+2593 ▓
opera: U+2593 ▓
chrome-hk: U+FFED ■
internetexplorer: U+2593 ▓
hkscs-2008: <U+FFED> ■
I'd say that we should go with U+FFED here, since that's what the
[HKSCS-2008] spec
says and it's visually close anyway.
Given that the goal is to define a unified Big5 (non-HK) and Big5-HKSCS
encoding and that this seems to be a case of the HK standard going
against everything and everyone else, perhaps more weight should be
given to existing specifications and (non-HK-specific) implementations.
Suggested change: map F9FE to U+2593
This is the only mapping where IE maps something other than PUA or "?"
that my mapping doesn't agree on, so I don't object to changing it. Still,
it would be very interesting to know why HKSCS-2008 changed it, do you
know?
Duplicates and reverse mappings:
big5-foolip.txt currently provides two different codepoints for 100
Unicode characters.
6 are mentioned above. 84 result from compatibility mappings defined in
the official HKSCS-2008 specification, cf. [2]. This leaves 10:
[2] <http://coq.no/character-tables/o-h-comp.pdf>
<http://coq.no/character-tables/o-h-comp.js>
U+5341 (十, 'ten') and U+5345 (卅 'thirty') are encoded twice in Big5,
once as numerals and once as standard Han characters. (U+5344 卄
'twenty' is only encoded once in Big5, but was added to HKSCS and is now
one of the 84 compatibility mappings.)
According to Lunde, the four codepoints F9FA--F9FD (ETen-2 extension)
are supposed to encode double-stroked circle segments which appear to be
missing from Unicode (I am not sure whether they have ever been proposed
for inclusion). They are currently mapped to single-stroked circle
segments instead, but those are already encoded at A2E7, A2A1--A2A3
(original Big5).
The four codepoints F9F9, F9E9, F9EB, F9EA (ETen-2 extension) encode
line-drawing characters with a double horizontal line. These appear to
be encoded at A2A4--A2A7 (original Big5) already, and it is not clear to
me whether the characters at A2A4--A2A7 are supposed to be different or
whether ETen chose to encode them again to have a full set of
line-drawing characters in one location.
Suggested reverse mappings:
C6CF <= U+5EF4
C6D3 <= U+65E0
C6D5 <= U+7676
C6D7 <= U+96B6
C6DE <= U+3003
C6DF <= U+4EDD
C6CD <= 5E7A (if the mapping of C6CD is changed)
Rationale: Only these mappings will work for non-HK Big5
implementations, and these characters appear to be important not only in
Hong Kong.
A451 <= U+5341
A4CA <= U+5345
A27E <= U+256D
A2A1<= U+256E
A2A3 <= U+256F
A2A2 <= U+2570
F9F9 <= U+2550
F9E9 <= U+255E
F9EB <= U+2561
F9EA <= U+256A
(The 84 compatibility mappings should obviously only be used to decode
and never as reverse mappings.)
Anne, how do you plan to define encoders for tables with duplicate
mappings? Have you collected data for what browsers currently do?
In any event, it clearly needs to be defined what to do for these 100 code
points that have multiple mappings to Big5. I extended my Python script to
find these 100 duplicates and to check what Python did for 'big5', falling
back to 'big5-hkscs'. This is what it produced:
8FB6 <= U+880F
90C4 <= U+96B6
91BE <= U+9F17
9242 <= U+8503
9361 <= U+5F0C
9455 <= U+7250
947A <= U+7468
96EE <= U+701E
9975 <= U+732A
9CE4 <= U+975D
9DEF <= U+5605
9DFB <= U+5ED0
A05F <= U+936E
A0D4 <= U+89A9
A0DC <= U+60A4
A1B2 <= U+3003
A259 <= U+5159
A25A <= U+515B
A25B <= U+515E
A25C <= U+515D
A260 <= U+74E9
A261 <= U+7CCE
A27E <= U+256D
A2A1 <= U+256E
A2A2 <= U+2570
A2A3 <= U+256F
A2A4 <= U+2550
A2A5 <= U+255E
A2A6 <= U+256A
A2A7 <= U+2561
A2CD <= U+5344
A451 <= U+5341
A4CA <= U+5345
A55D <= U+5305
A7FB <= U+675E
A9E4 <= U+62D0
A9F0 <= U+62CE
AACC <= U+8005
ABEC <= U+6062
ADC5 <= U+5029
ADEB <= U+537F
AFB0 <= U+79E3
B05F <= U+8D77
B0B0 <= U+507D
B3A3 <= U+90FD
B440 <= U+5A77
B4B8 <= U+6674
B4E4 <= U+6E2F
B4FC <= U+6E1D
B54E <= U+716E
B5AE <= U+7B51
B5D7 <= U+83C1
B7EC <= U+745C
B9B0 <= U+50ED
BAE6 <= U+7BB8
BAFC <= U+7DD2
BCB5 <= U+6490
BF47 <= U+6FB6
BFA6 <= U+7E1D
BFAE <= U+8028
BFCC <= U+89A6
C052 <= U+975C
C0E7 <= U+71DF
C554 <= U+97FF
C5F7 <= U+77D7
C95C <= U+5C10
C969 <= U+4EDD
C9DB <= U+5E75
C9FC <= U+6C4A
CA52 <= U+9097
CB58 <= U+6C9C
CDE7 <= U+4FBB
CFF1 <= U+7809
D0C0 <= U+91D4
D256 <= U+6D67
D4D1 <= U+5A67
D8F4 <= U+5F58
DB5D <= U+83CF
DB79 <= U+840F
DC52 <= U+9104
DE72 <= U+7162
DECD <= U+75F9
E07C <= U+8F0B
E3C8 <= U+84A8
E6AB <= U+7479
E6D0 <= U+799B
E8CD <= U+99D6
E959 <= U+5B28
EBC9 <= U+8F36
EDCA <= U+7C06
EFF9 <= U+7201
F1E3 <= U+9F16
F5E8 <= U+7E87
F86D <= U+9DF0
F9C4 <= U+9B2E
F9D7 <= U+92B9
FBFD <= U+5EF4
FCD3 <= U+65E0
FD64 <= U+60DE
FEC1 <= U+7676
These are the ones where you (Øistein) disagree:
C6CF <= U+5EF4
C6D3 <= U+65E0
C6D5 <= U+7676
C6D7 <= U+96B6
AFAICT this has nothing to do with compatibility mappings, so what's the
reason for this?
F9E9 <= U+255E
F9EA <= U+256A
F9EB <= U+2561
F9F9 <= U+2550
Python's big5-hkscs agrees, but Python's big5 does this instead:
A2A5 <= U+255E
A2A6 <= U+256A
A2A7 <= U+2561
A2A4 <= U+2550
It seems safer to go with the big5 mappings, but checking what browsers do
would be helpful.
How about the rest of my generated list, is that fine?
On Fri Apr 6 14:03:22 PDT 2012, Philip Jägenstedt <philipj at opera.com>
wrote:
There are 29 mappings to U+003F (?) in IE that no other browser has.
Are you referring to the ones at A3E2--A3FE? IE decodes (or used to
decode) the control pictures at A3C0--A3E0 as C0 control characters in
plain text, but replace(s) them with question marks in HTML. It looks
like this treatment has been extended to the the remaining A3xx
codepoints (after the euro), perhaps without a good reason.
Yes, that's the range. I think we should leave these undefined.
The remaining mappings are to PUA or U+FFFD in all browsers [...].
Mapping
these to U+FFFD unless anyone finds pages using these byte sequences
seems
the only sane option.
Agreed. Do any of these ever render in a meaningful way (e.g., in IE on
a Windows machine with HK locale and appropriate HKSCS PUA fonts)?
The following 22 codepoints are 'reserved for backwards compatibility'
in the HKSCS-2008 standard, but no Unicode mappings are provided:
9EAC
9EC4
9EF4
9F4E
9FAD
9FB1
9FC0
9FC8
9FDA
9FE6
9FEA
9FEF
A054
A057
A05A
A062
A072
A0A5
A0AD
A0AF
A0D3
A0E1
I assume some systems will render at least these as potentially
meaningful Han characters.
I generated
<http://people.opera.com/philipj/2012/04/08/big5-undefined-ie.txt> and had
a look using various Chinese fonts in Windows 7. It looks like most fonts
have a copy of the printable ASCII characters in U+F020 through U+F07E,
and what looks like parts of windows-1252 or latin-1 up to U+F0FF.
Exactly the 22 codepoints you list *are* Han characters in the
MingLiu_HKSCS font, see
<http://people.opera.com/philipj/2012/04/08/big5-mingliu-hkscs.png>.
Presumably they were not in Unicode when HKSCS-2008 was defined, but if
they have been added since I think we should simply map them.
Unfortunately, I haven't been able to find them by searching by radicals
in the Unihan database...
--
Philip Jägenstedt
Core Developer
Opera Software