It is not a 'fatal flaw'. NFD makes to pretensions to represent the most 'natural' ordering for any given language. Out of all the possible canonically equivalent sequences, it is simply a specific, well-defined, unique representation that is fully decomposed.
The issue of canonical equivalence itself is that that the circumflex and dot-below can come in any order and have precisely the same appearance, *and* that we could not predict the 'natural' order for any given language. Mark ————— Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Tuesday, January 29, 2002 22:51 Subject: Re: Unicode Search Engines > In a message dated 2002-01-28 7:37:48 Pacific Standard Time, > [EMAIL PROTECTED] writes: > > > I would like to add: > > How do they handle normalization? > > In Vietnam, many characters can be represented in several different ways: > > (1) fully precomposed (NFC) > > (2) base character and modifier precomposed, tonal mark combining > > (3) base character, then modifier, then tonal mark > > (4) like (3), but modifier and tonal mark sorted (NFD) > > Do the search engines do any normalization, before indexing a page? > > Are queries normalized before running the search? > > I'm not sure what sort of normalization might be performed by search engines, > but I want to examine the Vietnamese decomposition aspect for a moment. > > If you have a Vietnamese vowel with both modifier and tone mark, say LATIN > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in > Unicode in at least three ways: > > (1) fully precomposed (NFC) -- that is, U+1EA4 > (2) base character and modifier precomposed, tonal mark combining -- that is, > U+00C2 U+0301 > (3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302 > U+0301 > > So far, so good. But then we have: > > > (4) like (3), but modifier and tonal mark sorted (NFD) > > If "sorting" the diacritical marks in NFD results in rearranging the two > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of > Vietnamese orthography, the NFD form may not really be a legitimate way of > representing the Vietnamese letter. > > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is, > in Vietnamese, a circumflexed A to which a tone mark (dot below) has been > added. It is not a dotted-below A to which a circumflex has been added. Yet > because of the canonical combining classes of the two diacriticals (230 for > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how > the character will be decomposed. > > In theory, there is actually a case 5: base character and tonal mark > precomposed, modifier combining. In terms of Vietnamese orthography, this is > just as illegitimate as case 4 (NFD), but most software that processes > Vietnamese text will probably never encounter it. But it will have to handle > the NFD case. > > If I were on some other mailing lists I could think of, I would claim that > this is a fatal flaw in the design of Unicode Normalization Form D. It's > not, but it is a sticky problem that needs to be dealt with when dealing with > Vietnamese text. > > -Doug Ewell > Fullerton, California > >

