ICU has C header files with macros for code point handling in UTF-8/16 strings. See the utf8.h and utf16.h headers (together with utf.h) in ICU's source tree at source/common/unicode/.
http://oss.software.ibm.com/icu/download/ http://oss.software.ibm.com/cvs/icu/icu/source/common/unicode/
There is also a utf32.h header, but that is empty now. I redesigned the set of macros last year to simplify and improve them a bit.
Specifically, see below.
(Note that the UTF-8 macros [except for the "unsafe" ones] handle the complicated cases in functions that are called from inside the macros. See source/common/utf_impl.c . Safe UTF-8 handling requires a lot of error checks.)
askq1 askq1 wrote:
I want c/c++ code that will give me UTF8 byte sequence representing a given code-point, UTF16 16 bits sequence reppresenting a given code-point, UTF32 32 bits sequence representing a given code-point.
e.g.
UTF8_Sequence CodePointToUTF8(Unichar codePoint)
Use U8_APPEND(). http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a12
To read a code point from UTF-8, use U8_NEXT() http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a10
or U8_GET() etc.
UTF16_Sequence CodePointToUTF16(Unichar codePoint)
U16_APPEND() http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16
To read a code point from UTF-8, use U16_NEXT() http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16
or U16_GET() etc.
UCS2_Sequence CodePointToUCS2(Unichar codePoint)
For UCS-2, the best strategy (in my opinion) is to treat it exactly the same as UTF-16. Most people mean UTF-16 when they talk about UCS-2 or generally about "16-bit Unicode".
If you do want to distinguish them anyway, then this is trivial:
if(0<=codePoint<=0xffff) {
cast codePoint to 16-bit type and emit;
} else {
error;
}Similarly, UTF-32 is trivial as well - it just stores each code point value in a 32-bit integer unit. Unicode code points are values 0..0x10ffff.
See also http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/samples/ustring/ustring.cpp
I hope this helps - best regards, markus
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.

