In a message dated 2001-04-16 9:19:36 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  Is there an existing set of recommendations for dealing with this
>  problem (multiple legal compositions) in search and search-like
>  applications?  Specifically, if there are multiple legal ways to represent 
a
>  character, how should the character be stored, should search text be
>  preprocessede, etc.?  Pointers, anyone?

The UTF-8 Corrigendum that went into effect with (or shortly before?) Unicode 
3.0.1 clarified that only one UTF-8 sequence -- the shortest one -- is 
acceptable for any given Unicode character.  This is now part of Unicode 3.1, 
so check Unicode Standard Annex #27 at 
http://www.unicode.org/unicode/reports/tr27/ .

Otherwise, this sounds like it falls into the domain of normalization forms, 
and for that you can check Unicode Standard Annex #15 at 
http://www.unicode.org/unicode/reports/tr15/ .

-Doug Ewell
 Fullerton, California

Reply via email to