Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Jim Monty Wed, 03 Nov 2010 21:56:14 -0700

Björn Höhrmann wrote:
> The simple solution to that is a small state machine that you
> put each byte through...


Thank you very much for your suggestions, Björn.

>From your reply as well as from your Web page titled "Flexible and Economical 
UTF-8 Decoder" http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, it's obvious 
you're exactly the right C programmer to have written just the utility I'm 
looking for:  a corrupted UTF-16 text reporting and repair utility. The purpose 
of the utility would be to fix UTF-16 text that is mostly viable but 
nonetheless 
broken due to one or more noncharacters or invalid surrogate-pair code units. 
The rationale for such a utility is to make UTF-16 text that iconv, Perl and 
other software chokes on viable and usable.

Unfortunately, I'm not a good enough programmer to write such a utility in C or 
even Perl, the language I know best. Is this a project that interests you, by 
chance?

I'm surprised I'm having difficulty finding an existing utility to repair 
broken 
UTF-16 text. I thought this was something many programmers would need, 
especially Web developers.

Thank you again for your thoughtful reply.

Jim Monty

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to