----- Original Message ----- From: "Rick McGowan" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, August 13, 2003 10:15 PM Subject: Convert UTF code update
> Following on a recent bug report, and to fix problems with the last public > release, I have recently updated the "Convert UTF" sample code on the > Unicode web site. You can find the latest "alpha" code here: > > http://www.unicode.org/Public/ALPHA/CVTUTF-1-1/ > > There are some changes in "ConvertUTF.c" to better catch illegal > sequences, and a one-line change in "harness.c" to fix a buffer problem > what was independently reported by a few people. > > If you're a developer and you have a chance to look at this code and try > the harness, I would appreciate any error reports. I just noted the following fragment in ConvertUTF16toUTF8(): /* Figure out how many bytes the result will require */ if (ch < (UTF32)0x80) { bytesToWrite = 1; } else if (ch < (UTF32)0x800) { bytesToWrite = 2; } else if (ch < (UTF32)0x10000) { bytesToWrite = 3; } else if (ch < (UTF32)0x200000) { bytesToWrite = 4; } else { bytesToWrite = 2; ch = UNI_REPLACEMENT_CHAR; } shouldn't tyhe line: } else if (ch < (UTF32)0x200000) { bytesToWrite = 4; say instead: } else if (ch < (UTF32)0x110000) { bytesToWrite = 4; so that it will produce legal UTF-8 (according to the isLegalUTF8 function), by not encoding beyond the first 17 planes of UCS-4 (i.e. the currently only legal UTF-32 codespace)? For now the C fragment allows encoding to the legacy UTF-8 scheme (old RFC version) the first 32 planes of UCS-4, which goes beyond what UTF-16 can currently represent... As long that there will be no way in UTF-16 to go beyond the 17 first planes of UCS-4, the extra planes should not be encodable there using the old UTF-8 rules.

