Hi there, I'm going through the NormalizationTests.txt in the 6.3.0d1 database, and I ran across this line:
0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062; # (a◌̅◌̕◌̀◌֮b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; ) LATIN SMALL LETTER A, COMBINING OVERLINE, COMBINING COMMA ABOVE RIGHT, COMBINING GRAVE ACCENT, HEBREW ACCENT ZINOR, LATIN SMALL LETTER B The relevant parts for my question are: Source: 0061 0305 0315 0300 05AE 0062 NFD: 0061 05AE 0305 0300 0315 0062 NFC: 0061 05AE 0305 0300 0315 0062 I agree with the NFD decomposition result, but the NFC one seems wrong to me. If you look at rule D117 in the Unicode Spec http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf (I couldn't find the spec for 6.3 -- hopefully 6.2 is close enough), it gives the algorithm for NFC composition. The way I interpret it, this is how the composition proceeds: Starting with the NFD decomposition string, we retrieve the combining classes for each character from the UnicodeData.txt file: 0061 - 0 05AE - 228 0305 - 230 0300 - 230 0315 - 232 0062 - 0 You start at the first character after the starter (0061, with ccc=0), which is 05AE. There is no primary composition for the sequence 0061 05AE, so you move on. Looking at 0305, it is not blocked from 0061, so check the primary composition for 0061 0305. There is none for that either, so move on. Looking at 0300, it is also not blocked from 0061, so check the primary composition for 0061 0300. There is a primary composition for that sequence, 00E0, so replace the starter with that, delete the 0300, and continue. The string looks like this now: 00E0 - 0 05AE - 228 0305 - 230 0315 - 232 0062 - 0 Checking 0315 and 0062, they are not blocked, but there is no composition with 00E0, so the algorithm ends with the result: 00E0 05AE 0305 0315 0062 This disagrees with what it says in the normalization tests file as listed above. The question is, did I misunderstand the algorithm, or is this perhaps a bug in the data file? Thanks, Edwin

