Whilst writing a CJK pinyin lookup utility over the weekend I noticed that for
some CJK ideographs in the Unihan database that have multiple Mandarin readings,
the secondary reading(s) have been wrongly associated with adjacent or nearby
ideographs. For example :

U+543E  kMandarin       WU2 YA5
U+5440  kMandarin       YA1
-- YA5 is another reading for U+5440, but is not a reading for U+543E

U+54F5  kMandarin       BA1 HNG5
U+54FC  kMandarin       HENG1
-- HNG5 is another reading for U+54FC, but is not a reading for U+54F5

U+963E  kMandarin       A3 A4 A5 E1 E3 LING3
U+963F  kMandarin       A1
-- A3 A4 A5 E1 E3 are all readings for U+963F, but none are readings for U+963E

U+97A1  kMandarin       ENG1 LA5
U+97A5  kMandarin       YI4
-- ENG1 is a reading for U+97A5, but is not a reading for U+97A1 (in fact U+97A5
is the only character in Xiandai Hanyu Cidian that has a reading of ENG)

I know that the Mandarin readings given in the Unihan database are informative
only and may not necessarily correspond to the expectations of any given user
(as we have seen before on this list), but the problem I have noticed is not
that the readings are wrong or dubious per se, but rather that the correct
readings have been assigned to the wrong ideographs. This seems only to be a
problem with ideographs that have multiple Mandarin readings, and only affects
the non-primary readings (despite the fact that multiple readings are sorted
alphabetically).

Having noticed the obvious examples given above, I decided to go through the
Unihan entries for the CJK Unified Ideographs block to see if there were any
more examples of misassociated Mandarin readings. I gave up after going through
the entries for just the first five rows of the CJK Unified Ideographs block,
which appear to have at least ten misassociated variant or secondary Mandarin
readings :

U+4E0C  kMandarin       FOU1 FOU3 JI1
U+4E0D  kMandarin       BU4
-- FOU1 and FOU3 are secondary readings for U+4E0D, not U+4E0C

U+4E15  kMandarin       LIANG3 LIANG4 PI1
U+4E12  kMandarin       CHOU3
-- LIANG3 and LIANG4 are variant readings for U+4E12, not U+4E15

U+4E22  kMandarin       DIU1 LIANG4
U+4E21  kMandarin       LIANG3
-- LIANG4 is a secondary reading for U+4E21, not U+4E22

U+4E25  kMandarin       BANG4 YAN2
U+4E26  kMandarin       BING4
-- BANG4 is a secondary reading for U+4E26, not U+4E25

U+4E2B  kMandarin       YA1 ZHONG4
U+4E2D  kMandarin       ZHONG1
-- ZHONG4 is a secondary reading for U+4E2D, not U+4E2B

U+4E33  kMandarin       CHAN3 LIN4
U+4E34  kMandarin       LIN2
-- LIN4 is a secondary reading for U+4E34, not U+4E33

U+4E3B  kMandarin       LI2 ZHU3
U+4E3D  kMandarin       LI4
-- LI2 is a secondary reading for U+4E3D, not U+4E3B

U+4E3E  kMandarin       JU3 N�E4 TUO1 ZHE4
-- I'm not sure where the readings N�E4, TUO1 and ZHE4 belong, but I'm pretty
certain it isn't U+4E3E

U+4E49  kMandarin       WU4 YI4
U+4E4C  kMandarin       WU1
-- WU4 is a secondary reading for U+4E4C, not U+4E49

U+4E4F  kMandarin       FA2 LUO4 YAO4 YUE4
U+4E50  kMandarin       LE4
-- LUO4, YAO4 and YUE4 are all secondary readings for U+4E50, not U+4E4F

>From the number of misplaced Mandarin readings in this small sample (10 out of
80 characters) it would seem to me that the problem is probably endemic
throughout the CJK Unified Ideographs block (I think that CJK-A is OK, but I
haven't looked carefully enough to be sure).

Is it possible to regenerate the Unihan database with the correct secondary
Mandarin readings ?

Andrew

Reply via email to