On Fri, 1 Nov 2013 15:37:22 +0000 Jennifer Wong <[email protected]> wrote:
> I would like to ask for advice on removing accents from characters. Don't do it. > While the normalization process is straight forward (NFD, remove > accents), it does not take into account of special cases. For > example, Danish, "å" should be mapped to "aa", not "a". Likewise, in > German, "ä" "ö" "ü" should be mapped to "ae", "oe" and "ue" > respectively, not "a", "e", "u". Are there common practices on how to > handle these special cases? Thank you. There are numerous ASCIIfication conventions, generally of limited extent. For example, while the Romanian telegraphic convention would turn a squiggle below into a 'z', one ASCIIfication of Sanskrit 'ç' would use 's' followed by apostrophe and the academically dominant method, the Harvard-Kyoto convention, would use 'z'. It may be worth mentioning that combining marks can be of equal rank with the base characters. Stripping the vowel marks from text in an Indic script is as acceptable as stripping the vowels from English. Also, I find it hard to believe that anyone but a Tamil would consider a consonant-vowel combination in an Indic script a single character. I was intensely annoyed to find LibreOffice treating a <consonant, virama, consonant> combination as a single character for editing purposes; I had to resort to a regular expression search and replace operation to insert a space after the virama. It could be worse - on Ubuntu 12.04 gnome-terminal and xterm, typing <THAI CHARACTER BO BAIMAI, THAI CHARACTER SARA I, rubout> results in no net character input! Richard.

