On 12 Apr 2012, at 08:26, Philip Jägenstedt wrote:

>>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but it's 
>>> not the only hanzi in HKSCS-2008 that normalizes into something else:
> 
> That the characters in the above list look slightly different is really a 
> font issue, they are canonically equivalent in Unicode and therefore the 
> same, AFAICT.

Sorry, you are right about that, of course.  U+2F33 and U+5E7A are not 
canonically equivalent, and I just assumed that was the case for the others as 
well without thinking.

> U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008 and 
> I agree that it's weird. However [...], I'm not really comfortable with 
> fixing bugs in HKSCS-2008, at least not based only on agreement by two 
> Northern Europeans like us... If users or implementors from Hong Kong or 
> Taiwan also speak up for U+5E7A, then I will not object.

I certainly agree with that sentiment.

>>>>> F9FE =>
> [...]
> U+FFED decomposes to U+25A0 which could perhaps be more appropriate,

Yes, except that A1BD maps to U+25A0.

> but I suggest sticking with U+FFED and recommending people to use UTF-8 if 
> they want some particular square shape.

That makes sense.  Cf. python again for a less web-centric point of view:

>>> b'\xf9\xfe'.decode('big5-hkscs')
u'\uffed'
>>> b'\xf9\xfe'.decode('cp950')
u'\u2593'
>>> b'\xf9\xfe'.decode('big5')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal 
multibyte sequence

>> Does this imply that Python's big5 (non-HK) implementation does not include 
>> the corresponding E-Ten 2 (forward) mappings for decoding either?
> 
> So says python3:
> 
>>>> b'\xf9\xe9'.decode('big5')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal 
> multibyte sequence
>>>> b'\xf9\xe9'.decode('big5-hkscs')
> '╞'

Python also says:

>>> b'\xf9\xe9'.decode('cp950')
u'\u255e'

> Are there any sites that use these line drawing characters that would be 
> fixed by this? If not, I'm quite willing to accept the historical accidents 
> and move on :)

Probably not many.  Still, it seems safe to fix these four mappings if the 
characters are ever added to Unicode.

Øistein E. Andersen

Reply via email to