Re: UTF-8 Syntax

Antoine Leca Mon, 11 Jun 2001 12:07:11 -0700
[EMAIL PROTECTED] wrote:
> 
> Carl W. Brown <[EMAIL PROTECTED]> wrote:
> >In the case of strcmp the problem is that this won't even work on UCS-2.
> >It detects the end of string with a single byte 0x00.  You have to use a
> >special Unicode compare routine this routine needs to be fixed to produce
> >proper compares.
> 
> Sorry - read "wcscmp" for "strcmp".  4 years of Unicode coding and still
> slipping back to old function names!

Sorry, Carl's point is still a good one. wcscmp behaves the way you described,
*only* on a box where wchar_t happens to be 16-bit wide (whether it is
UTF-16 or more simply UCS-2 is irrelevant here). In particular, on a typical
gcc-driven box where wchar_t are 32-bit, it does *not* behave this way.

So, as Carl explained, either you use a custom Unicode compare function,
potentially under-optimized, and you make yourself sure you are using
some 16-bit integer type for the datas (and there are other pitfalls then,
as I am sure you may know). Or you stick with wchar_t, but you are really
opening another potential trap for later, that can be summarized as requesting
_native_compiler_support_ for UTF-32s... which will certainly be refused.
In other words, this may well mean that your "solutions" that rely on using
wcscmp() are really impossible to port to the Linux/Unix-as-a-user-station
market.

This point may not be a problem for you right now, probably because your
overwhelming majority of clients do run on Win32 clients. However, imagine
what may happen in the future, for example when memory will cost less (it
will) and when surrogates became more common (this is more debatable):
perhaps it will then start to be more productive to have in-memory
representations using straight 32-bit characters, which have not the
difficulties inherent with surrogates (counting characters for displaying,
for example). These day, or some moment after it became obvious, Microsoft
will modify their compilers to follow the Standard and will make wchar_t
32-bit (or perhaps the standard to write GUI had evolved to be based on
some evolution of gcc; the net result is the same): this very day, you
will be doomed with your dependancy toward the specific binary order of
UTF-16...


Also, you refer yourself to the scenario of a UTF-16 client against a
UTF-8 server. I do not understand why you could not ask the client to
implement the modification (that is, to run a modified version of the
"wcscmp()" that sorts surrogates *after* range E000-FFFD): this would be
convenient, since binary comparisons with use of the sign of the result
should be fairly uncommon on the client side.
At least, I think so.


Antoine
Re: UTF-8 Syntax

Reply via email to