On 12 Apr 2012, at 08:26, Philip Jägenstedt wrote:
>>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but it's
>>> not the only hanzi in HKSCS-2008 that normalizes into something else:
>
> That the characters in the above list look slightly different is really a
> font issue, they are canonically equivalent in Unicode and therefore the
> same, AFAICT.
Sorry, you are right about that, of course. U+2F33 and U+5E7A are not
canonically equivalent, and I just assumed that was the case for the others as
well without thinking.
> U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008 and
> I agree that it's weird. However [...], I'm not really comfortable with
> fixing bugs in HKSCS-2008, at least not based only on agreement by two
> Northern Europeans like us... If users or implementors from Hong Kong or
> Taiwan also speak up for U+5E7A, then I will not object.
I certainly agree with that sentiment.
>>>>> F9FE =>
> [...]
> U+FFED decomposes to U+25A0 which could perhaps be more appropriate,
Yes, except that A1BD maps to U+25A0.
> but I suggest sticking with U+FFED and recommending people to use UTF-8 if
> they want some particular square shape.
That makes sense. Cf. python again for a less web-centric point of view:
>>> b'\xf9\xfe'.decode('big5-hkscs')
u'\uffed'
>>> b'\xf9\xfe'.decode('cp950')
u'\u2593'
>>> b'\xf9\xfe'.decode('big5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal
multibyte sequence
>> Does this imply that Python's big5 (non-HK) implementation does not include
>> the corresponding E-Ten 2 (forward) mappings for decoding either?
>
> So says python3:
>
>>>> b'\xf9\xe9'.decode('big5')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal
> multibyte sequence
>>>> b'\xf9\xe9'.decode('big5-hkscs')
> '╞'
Python also says:
>>> b'\xf9\xe9'.decode('cp950')
u'\u255e'
> Are there any sites that use these line drawing characters that would be
> fixed by this? If not, I'm quite willing to accept the historical accidents
> and move on :)
Probably not many. Still, it seems safe to fix these four mappings if the
characters are ever added to Unicode.
Øistein E. Andersen