RE: UTF-16 inside UTF-8

Philippe Verdy Tue, 02 Dec 2003 18:54:30 -0800

Frank Yung-Fong Tang writes:
> But how about the UTF-16 vs UCS4 battle?

Forget it: nearly nobody uses UCS-4 except very internally for string
processing at the character level. For whole strings, nearly everybody uses
UTF-16 as it performs better with less memory costs, and because UCS-4 is
not needed.


Handling surrogates found in surrogates is quite simple and in fact it is
even simpler to detect and manage than handling MBCS-encoded strings for
Asian 8-bit applications, and today MBCS 8-bit processing is performed by
transforming it first into equivalent internal 16-bit code positions, or
sometimes by transcoding it to Unicode with UTF-16.

So I do think that applications that could handle East-Asian DBCS 8-bit text
(EUC-*, ISO2022-*, JIS) can very easily be modified to work internally with
UTF-16 (notably because interoperability of Unicode code points with these
DBCS charsets is excellent as the transcoding is not ambiguous, bijective,
does not need code reordering, and just consists in a simple mapping table
implemented now in all OSes localized for Asian markets).

East-Asian developers have learned since long how to cope with DBCS-encoded
strings. Now with UTF-16, handling surrogates found in string is even
simpler, as UTF-16 allows bidirectional and random access to any positions
in strings, which means additional performance and less tricky algorithms
for text processing...


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: UTF-16 inside UTF-8

Reply via email to