> > I very strongly believe that the first option [assume that the string is > utf8] is the right one as: > 1) It's computationally cheaper > 2) AMQP defines strings as UTF8 so it's not actually an unreasonable > assumption to assume a std::string is UTF8 in a well designed interoperable > application (which is the sort of behaviour we should be encouraging :-)) > 3) If the encoding fails an exception can be thrown - and as it really > ought to be a UTF8 string quite rightly IMHO. Question though does it > actually fail during the encoding process or is the encoding just wrong and > thus risks confusing JMS etc. clients? Even if the latter I suspect that the > risk is modest - binary values in strings are the Devil's work :-) > 4) "Unlike a java.lang.String, the c++ std::string does not imply textual > data ". Actually IMHO std::string really does *imply* textual data. I think > it's very poor practice to use std::string on binary data, use a char* a > uint8t* or better yet a proper class to manipulate the actual type that is > under consideration. I'd take a fairly dim view of my developers if they did > that sort of thing without really good justification, that's the sort of > thing that ends up making code unmaintainable in the long run (shall I get > down off my high horse now :-)) >
In my opinion it is not so obvious, because as far as I know: - AMQP allows UTF-8 or UTF-16 strings. - Many C++ applications supporting Unicode store strings in std::wstring with UCS-2 encoding. Having fixed character size of 2 bytes per code point allows for simple and efficient string manipulations. If required, conversions to/from UTF-8 are performed on interfaces to the outside world. (BTW I think this is also the case for Java.) - In C++ it is fairly common to use std::string as a container for binary data. I would not say it is wrong to do that. I personally would say that in C++ there is no "default" character encoding. Defaulting to UTF-8 makes some sense because all 7-bit ASCII strings are UTF-8. But it may be dangerous to assume UTF-8 for all strings and it would be probably be safer to somehow force the C++ programs to explicitly specify the encoding when reading and writing strings. In Java, the default encoding is apparently UTF-8, but the Java client should still be able to accept strings encoded in UTF-16. I think that the Qpid client libraries should support implicit conversions between UTF-8 and UTF-16/UCS-2. I believe it is acceptable to support only the UCS-2 character set (the Unicode's Basic Multilingual Plane) in C++ client.
