Multiple encoding used with Unihan database

z-test Tue, 17 Apr 2012 08:04:24 -0700

Good morning!

I frequently consult the Unihan database to get detailed informationabout Japanese and Chinese characters, and I have noticed that atleast some pages are encoded in more than one encoding, that is tosay, although the main encoding is in "UTF-8" (as one would expect onthe Unihan site), certain characters on those pages are encoded in"ISO-8859-1", which makes them unreadable until one forces a changeof the encoding.


I just looked at these pages:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=58b3(character: 墳)http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5893(character: 墓)

The wrongly encoded characters appear here in the Hanyu Pinyincolumn: the accented letters are from the ISO-8859-1 charset and notfrom UTF-8 and will only become legible if one changes the encodingsetting to ISO-8859-1 (which renders, of course, much the rest of thepage unusable)


kHanyuPinyin 10485.060:fén,fèn
kHanyuPinyin 10470.090:mù

I suspect that the providers of this information would like to seeall of it to be encoded in UTF-8 and that the current encoding schemeis just an accident. :-)


Thank you for your time!

Multiple encoding used with Unihan database

Reply via email to