Quoting Hallvard B Furuseth <[EMAIL PROTECTED]>: > I need a function which converts Latin Unicode characters to the closest > equivalent ASCII characters, e.g. "�" -> "e". > > Before I reinvent the wheel, does any public domain or GPL code for this > already exist? > > If not, > for the most part I expect I can make the mapping from the character > names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE' > in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>. > Punctuation and other non-letters will be worse, but they are less > important to me anyway. >
1. Produce the NFD normalisation of the text. 2. Remove all characters with a non-zero combining class. 3. Some non-ASCII characters may remain (particularly those from non-Latin scripts) handling of some can be done nicely, but some may require you to raise an exception or output a replacement character. This can be done efficiently with a streaming processor if the size of the source text is large. You may want to use NFKD rather than NFD. NFKD would, for example, convert the trademark symbol to "TM" and superscript 2 to "2" - this would allow you to convert more characters but the loss of semantics may be problematic depending on your application. Specialised handling of some characters is possible, for instance you could convert the trademark sign to "(TM)" to avoid confusion, of course this wouldn't be possible with an existing normalisation API, though if the number of characters handled specially is small it would be possible to do that in a first pass. -- Jon Hanna | Toys and books <http://www.hackcraft.net/> | for hospitals: | <http://santa.boards.ie>

