On Mon, 4 Nov 2013 19:00:17 +0000 Jennifer Wong <[email protected]> wrote:
> Thank you everyone for your input. > > The use case is that customers want to integrate data from our > enterprise solution to their ASCII-based downstream systems. Thus all > accents need to be removed. Have you confirmed that they are using ASCII rather than say, Latin-1? Some people call Latin-1 ASCII! > Ilay's "Transliteration on Passport" doc is very useful. We can use > that as a basis to map special transliteration cases before > normalizing and removing accents. Have you checked how they are currently handling accents? Do you need to be even more brutal in places and strip out apostrophes? An O'Sullivan at my place of work had to accept the mangling of his surname to Osullivan! How are you constraining the input repertoire? Stripping diacritics won't deal with U+0131 LATIN SMALL LETTER DOTLESS I, and would make a mess of the usually incorrect <U+0131, U+0307 COMBINING DOT ABOVE>. Richard.

