yes, thanks. marq —————
Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, January 30, 2002 07:48 Subject: Re: Unicode Search Engines > > On 30/01/2002 15:30:06 Mark Davis wrote: > > It is not a 'fatal flaw'. NFD makes to pretensions to represent the > > I imagine that "to" -> "no". > > Misha > > > most 'natural' ordering for any given language. Out of all the > > possible canonically equivalent sequences, it is simply a specific, > > well-defined, unique representation that is fully decomposed. > > > > The issue of canonical equivalence itself is that that the circumflex > > and dot-below can come in any order and have precisely the same > > appearance, *and* that we could not predict the 'natural' order for > > any given language. > > > > Mark > > ————— > > > > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο >πάντα — Ὁμήρου Μαργίτῃ > > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] > > > > http://www.macchiato.com > > > > ----- Original Message ----- > > From: <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Cc: <[EMAIL PROTECTED]> > > Sent: Tuesday, January 29, 2002 22:51 > > Subject: Re: Unicode Search Engines > > > > > > > In a message dated 2002-01-28 7:37:48 Pacific Standard Time, > > > [EMAIL PROTECTED] writes: > > > > > > > I would like to add: > > > > How do they handle normalization? > > > > In Vietnam, many characters can be represented in several > > different ways: > > > > (1) fully precomposed (NFC) > > > > (2) base character and modifier precomposed, tonal mark combining > > > > (3) base character, then modifier, then tonal mark > > > > (4) like (3), but modifier and tonal mark sorted (NFD) > > > > Do the search engines do any normalization, before indexing a > > page? > > > > Are queries normalized before running the search? > > > > > > I'm not sure what sort of normalization might be performed by search > > engines, > > > but I want to examine the Vietnamese decomposition aspect for a > > moment. > > > > > > If you have a Vietnamese vowel with both modifier and tone mark, say > > LATIN > > > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent > > this in > > > Unicode in at least three ways: > > > > > > (1) fully precomposed (NFC) -- that is, U+1EA4 > > > (2) base character and modifier precomposed, tonal mark combining -- > > that is, > > > U+00C2 U+0301 > > > (3) base character, then modifier, then tonal mark -- that is, > > U+0041 U+0302 > > > U+0301 > > > > > > So far, so good. But then we have: > > > > > > > (4) like (3), but modifier and tonal mark sorted (NFD) > > > > > > If "sorting" the diacritical marks in NFD results in rearranging the > > two > > > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in > > terms of > > > Vietnamese orthography, the NFD form may not really be a legitimate > > way of > > > representing the Vietnamese letter. > > > > > > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT > > BELOW is, > > > in Vietnamese, a circumflexed A to which a tone mark (dot below) has > > been > > > added. It is not a dotted-below A to which a circumflex has been > > added. Yet > > > because of the canonical combining classes of the two diacriticals > > (230 for > > > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the > > latter is how > > > the character will be decomposed. > > > > > > In theory, there is actually a case 5: base character and tonal mark > > > precomposed, modifier combining. In terms of Vietnamese > > orthography, this is > > > just as illegitimate as case 4 (NFD), but most software that > > processes > > > Vietnamese text will probably never encounter it. But it will have > > to handle > > > the NFD case. > > > > > > If I were on some other mailing lists I could think of, I would > > claim that > > > this is a fatal flaw in the design of Unicode Normalization Form D. > > It's > > > not, but it is a sticky problem that needs to be dealt with when > > dealing with > > > Vietnamese text. > > > > > > -Doug Ewell > > > Fullerton, California > > > > > > > > > > > > -------------------------------------------------------------- -- > Visit our Internet site at http://www.reuters.com > > Any views expressed in this message are those of the individual > sender, except where the sender specifically states them to be > the views of Reuters Ltd. >

