Doug Ewell wrote: > Jim Monty <jim dot monty at yahoo dot com> wrote: > > Is there a utility, preferably open source and written in C, that inspects > > UTF-16/UTF-16BE/UTF-16LE text and identifies broken surrogate pairs and >illegal > > > characters? Ideally, the utility can both report illegal code units and >"repair" > > > them by replacing them with U+FFFD. > > What's an "illegal" character, for purposes of this exercise? Do you > mean a noncharacter, or something else?
I mean the sixty-six code point Unicode reserves as noncharacters (e.g., U+FFFE). http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters But I'm most keenly interested in a utility that detects broken UTF-16 surrogate pairs. By this, I mean a 16-bit code unit in the range from 0xD800 thru 0xDBFF not immediately followed by a 16-bit code unit in the range from 0xDC00 thru 0xDFFF, and vice versa. http://en.wikipedia.org/wiki/UTF-16/UCS-2#Encoding_of_characters_outside_the_BMP I need to repair broken UTF-16 text that some software (e.g., GNU iconv) and programming languages (e.g., Perl) choke on. See this discussion of the topic on PerlMonks. http://www.perlmonks.org/?node_id=719833 Jim Monty