Re: UTF-16 inside UTF-8

YTang0648 Wed, 05 Nov 2003 13:41:20 -0800

In a message dated 11/5/2003 11:15:44 AM Pacific Standard Time, [EMAIL PROTECTED] writes:

Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

>> At the risk of upsetting the open-source faithful, that is just plain
>> lazy.
>
> I don't think you shoudl call it "lazy". It is just "under
> construction" if such software is still in "alpha". How many software
> have such support in their "Alpha" stage in your company ?

My company is not the best example here; we're well behind the curve
when it comes to Unicode, and i18n/L10n generally.

That said, I think it would be much faster and less error-prone, for a
company adding Unicode support to a product for the first time, to
support the entire Unicode range from the outset, rather than supporting
just the BMP in the alpha stage and then "adding" support for
supplementary characters. For UTF-8 in particular, I can't imagine why
one would choose to implement the 1-, 2-, and 3-byte forms in one stage
and add the 4-byte forms in a later stage.

If you ever move a software implementation from support only single byte charset to support full unicode 4.0 , then you will be able to image it. Especially if the project also have 20-100 people working that who don't care too much about unicode or international support. I have working on such projects for more than 10 years. And for me, it is very reasonable to have such staging approach.

for a very simple reason. Usually what happen is the software need to use something other than UTF-8 for internal process. For example, mozilla take UTF-8 as input and it convert to UTF-16 for internal storage. The reason the UTF-8 is not ideal for some internal process is for example "ToUpper" ot 'ToLower" operation (or collation, etc) it is much easier to build a UCS2 base toupper to lower table than a UTF-8 based one.

Because of this, software process probably don't want to use UTF-8 as internal. It is ok for those software which just store the data or pass the data by to use UTF-8 as internal, but UTF-8 is not ideal as internal format for those software process data.

Then the next reason is the software may have some api which take or return character index of a string. For example if your software have api like the following:

int TheFirstCharacterInTheString( String, Character) return the first character index of the character in the String

string TheLeftSubString( String, Length) return the left "length' characters.

then UCS2 or UCS4 is eaiser to deal with, and UTF-8 or UTF-16 is much harder to deal with. Because in UCS2 or UCS4, you can find out the memory requirments / offset from the character index, and vise versa. But in the UTF-8 or UTF-16, you cannot. For return the index or lenght, you basically need to have two set of api, one to return the number of "characters" and one to return the number "memory requirment" if the caller may need to prepare the memory. )

Because of this, it is much easier to use UCS2 or UCS4 in the API or probably I should say private interface inside the software. However, using UCS4 will doble the memory requirment compare to UCS2, which already double the memory requirment from the single byte only support (for some software, that mean the last version). Therefore, it is eaiser to move from only support single bytes encoding to move to a UTF-8 support which only up to 3 bytes in the first version they move to Unicode.

I am not saying this is the ideal case and they should do that. I am just telling you what will people face and think when they move from a ISO-8859-1 only implementation to a pure Unicode implementation. A lot of time, they need to deal with one thing per step.

Usually the staging approach is

1. add the internal data type from char to some other data type, probably a typedef uniChar

if you ask the uniChar to be 4 bytes, you will hit a hard wall, die and stop there. If you ask the uniChar to be 2 bytes, you will hit a wall, break both your head and the wall and continue.

2. add converter to convert ISO-8859-1 and UTF-8 from/to that uniChar

3. Migrate all the code

4. Talk to people about support UTF-16 or change uniChar to 4 bytes after you proof changing 1-3 bring in a lot of value and does not cause too much performance/footprint issue.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

==================================
Frank Yung-Fong Tang
System Architect, I�t�rn�ti�n�l D�v�l�pme�t, AOL Int�r��t�v� S�rvi�es
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's I�t�rn�ti�n�liz�ti�n Secrets
Want to translate your English text to something Thailand users can understand ?
-> Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/

Re: UTF-16 inside UTF-8

Reply via email to