On Wed, 28 Mar 2012 15:36:35 +0200, Anne van Kesteren <ann...@opera.com>
wrote:
On Wed, 28 Mar 2012 12:18:41 +0200, Anne van Kesteren <ann...@opera.com>
wrote:
I'm not sure what to do with big5 and big5-hkscs. After generating all
possible byte sequences (lead bytes 0x81 to 0xFE, trail bytes 0x40 to
0x7E and 0xA1 to 0xFE) and getting the code points for those in various
browsers there does not seem to be that much interoperability.
http://html5.org/temp/big5.json has all the code points for Internet
Explorer ("internetexplorer", same for big5 and hkscs), Firefox
("firefox" and "firefox-hk"), Opera ("opera" and "opera-hk"), and
Chrome ("chrome" and "chrome-hk"). "internetexplorer" and "chrome" are
quite close, the rest is a little further apart.
Some help as to how best to proceed would be appreciated.
To give some more context, IE treats big5 and big5-hkscs identical. Out
of the total 19782 code points, 6217 of them map to the Private Use Area
(PUA) in IE. Chrome does the same for big5, but has a different mapping
for big5-hkscs. To deal with HKSCS Microsoft brought out this patch:
http://www.microsoft.com/hk/hkscs/ Basically people living in the Hong
Kong area are expected to have that installed and therefore the PUA code
points map to different glyphs. I'm not sure what the situation is like
on Mac or Linux, but given the market share statistics I saw the market
is pretty heavenly dominated by Microsoft.
Gecko seems to use a combination of things as documented in
https://bugzilla.mozilla.org/show_bug.cgi?id=310299 though it is unclear
how successful that approach is.
There are also various threads online such as
http://www.google.com/support/forum/p/Chrome/thread?tid=466c210af3fb6d08
that seem to indicate "pages in the Hong Kong area" are not using the
big5-hkscs label and therefore rely on what IE and Chrome do for big5
and rely on users having the compatible fonts.
Making big5 and big5-hkscs aliases sounds like a good idea, on the
assumption that big5-hkscs is a pure extension of Big5.
To make this more concrete, here are a few fairly common characters that I
think are in big5-hkscs but not in big5, their unicode point and byte
representation in big5-hkscs when converted using Python:
啫 U+556B '\x94\xdc'
嗰 U+55F0 '\x9d\xf5'
嘅 U+5605 '\x9d\xef'
I'm not sure how to use big5.json, so perhaps you can tell me what these
map to in various browsers? If they're all the same, examples of byte
sequences that don't would be interesting.
It seems fairly obvious that the most sane solution would be to just use a
more correct mapping that doesn't involve the PUA, but:
1. What is the compatible subset of all browsers?
2. Does that subset include anything mapping to the PUA?
3. Do Hong Kong or Taiwan sites depend on charCodeAt returning values in
the PUA?
4. Would hacks be needed on the font-loading side if browsers started
using a more correct mapping?
--
Philip Jägenstedt
Core Developer
Opera Software