UTF-8s is reminiscent of a problem that I had installing a certain vendor's terminals. Each screen was about 2K of data. The terminal communications protocol broke the data into 128 byte chunks. Each block had a small header and the terminal would wait for a response before the next block was sent. My client was trying to connect using an X.25 link over a satellite link. Each 128 byte required two X.25 packets and the response required another packet. X.25 also has its own pacing and response transmissions. Satellite links take about 1.4 seconds to get from one location to another. The actual calculations are complex but as you can see the round trip for 1/16th of the data is about 3 seconds. The result was unusable. One obvious programming problem is UTF-8s to UTF-32 conversions. But other are more subtle. The major problem is that like the example above, UTF-8s is encoding part of the character (One surrogate) as a character. You end up encoding a single character as two UTF-8 characters. I am currently working on code that supports UTF-8 and I am implementing a function library for it. Take the example of xiu8_strtok. It in turn calls xiu8_strspn and xiu8_strpbrk. Each of these routines scans for delimiters using a set of deliminators in the from of a UTF-8 character string. If each surrogate is encoded separately the scan will find a match for any character with either of the same surrogates. Now what happens. Supposedly when scanning for the start of the first token we skip deliminiters and we get a match on the high-surrogate of the pair but not the low-surrogate. This means that we start out first token string in the middle of our first character. If the ending token match is a low surrogate we will replace that with a null and terminate the token string with another half character. If I am chunking UTF-8 data into a buffer to convert to UTF-16 and then translate to a charset I can break the UTF-8s code in the middle of a character and even if my UTF-8s to UTF-16 converter works, I will have broken UTF-16 data. Functions like xiu8_CharNext, xiu8_CharCnt, xiu8_CharLen etc. do not work. I could go on but further examples are redundant. You end up breaking so much code just for a sorting sequence when comparing UTF-16 in Unicode code point sequence is so easy to write. Carl

