Quoting Hallvard B Furuseth <[EMAIL PROTECTED]>:

> I need a function which converts Latin Unicode characters to the closest
> equivalent ASCII characters, e.g. "�" -> "e".
> 
> Before I reinvent the wheel, does any public domain or GPL code for this
> already exist?
> 
> If not,
> for the most part I expect I can make the mapping from the character
> names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
> in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
> Punctuation and other non-letters will be worse, but they are less
> important to me anyway.
> 

1. Produce the NFD normalisation of the text.
2. Remove all characters with a non-zero combining class.
3. Some non-ASCII characters may remain (particularly those from non-Latin 
scripts) handling of some can be done nicely, but some may require you to raise 
an exception or output a replacement character.

This can be done efficiently with a streaming processor if the size of the 
source text is large.

You may want to use NFKD rather than NFD. NFKD would, for example, convert the 
trademark symbol to "TM" and superscript 2 to "2" - this would allow you to 
convert more characters but the loss of semantics may be problematic depending 
on your application. Specialised handling of some characters is possible, for 
instance you could convert the trademark sign to "(TM)" to avoid confusion, of 
course this wouldn't be possible with an existing normalisation API, though if 
the number of characters handled specially is small it would be possible to do 
that in a first pass.

--
Jon Hanna                   | Toys and books
<http://www.hackcraft.net/> | for hospitals:
                            | <http://santa.boards.ie>

Reply via email to