Thank you everyone for your input.

The use case is that customers want to integrate data from our enterprise 
solution to their ASCII-based downstream systems. Thus all accents need to be 
removed.

Ilay's "Transliteration on Passport" doc is very useful. We can use that as a 
basis to map special transliteration cases before normalizing and removing 
accents.

Jennifer

From: Markus Scherer <[email protected]<mailto:[email protected]>>
Date: Monday, November 4, 2013 11:54 AM
To: Jennifer Wong <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: How to remove accents while conforming to language standards?

Hi Jennifer,

On Fri, Nov 1, 2013 at 8:37 AM, Jennifer Wong 
<[email protected]<mailto:[email protected]>> wrote:
I would like to ask for advice on removing accents from characters. While the 
normalization process is straight forward (NFD, remove accents), it does not 
take into account of special cases. For example, Danish, "å" should be mapped 
to "aa", not "a". Likewise, in German, "ä" "ö" "ü" should be mapped to  "ae", 
"oe" and "ue" respectively, not "a", "e", "u". Are there common practices on 
how to handle these special cases? Thank you.

Can you describe what your use case is?

One possible area that appears not to have been discussed yet is sorting of 
strings and full-text search (as in ctrl-F in a browser or word processor). If 
you are after those, then please look for "unicode collation" and "cldr 
collation". The ICU libraries<http://userguide.icu-project.org/collation> might 
also help.

Best regards,
markus

Reply via email to