|
In a message dated 11/5/2003 11:15:44 AM Pacific Standard Time, [EMAIL PROTECTED] writes:
Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote: If you ever move a software implementation from support only single byte charset to support full unicode 4.0 , then you will be able to image it. Especially if the project also have 20-100 people working that who don't care too much about unicode or international support. I have working on such projects for more than 10 years. And for me, it is very reasonable to have such staging approach.
for a very simple reason. Usually what happen is the software need to use something other than UTF-8 for internal process. For example, mozilla take UTF-8 as input and it convert to UTF-16 for internal storage. The reason the UTF-8 is not ideal for some internal process is for example "ToUpper" ot 'ToLower" operation (or collation, etc) it is much easier to build a UCS2 base toupper to lower table than a UTF-8 based one.
Because of this, software process probably don't want to use UTF-8 as internal. It is ok for those software which just store the data or pass the data by to use UTF-8 as internal, but UTF-8 is not ideal as internal format for those software process data.
Then the next reason is the software may have some api which take or return character index of a string. For example if your software have api like the following:
int TheFirstCharacterInTheString( String, Character) return the first character index of the character in the String
or
string TheLeftSubString( String, Length) return the left "length' characters.
then UCS2 or UCS4 is eaiser to deal with, and UTF-8 or UTF-16 is much harder to deal with. Because in UCS2 or UCS4, you can find out the memory requirments / offset from the character index, and vise versa. But in the UTF-8 or UTF-16, you cannot. For return the index or lenght, you basically need to have two set of api, one to return the number of "characters" and one to return the number "memory requirment" if the caller may need to prepare the memory. )
Because of this, it is much easier to use UCS2 or UCS4 in the API or probably I should say private interface inside the software. However, using UCS4 will doble the memory requirment compare to UCS2, which already double the memory requirment from the single byte only support (for some software, that mean the last version). Therefore, it is eaiser to move from only support single bytes encoding to move to a UTF-8 support which only up to 3 bytes in the first version they move to Unicode.
I am not saying this is the ideal case and they should do that. I am just telling you what will people face and think when they move from a ISO-8859-1 only implementation to a pure Unicode implementation. A lot of time, they need to deal with one thing per step.
Usually the staging approach is
1. add the internal data type from char to some other data type, probably a typedef uniChar
if you ask the uniChar to be 4 bytes, you will hit a hard wall, die and stop there. If you ask the uniChar to be 2 bytes, you will hit a wall, break both your head and the wall and continue.
2. add converter to convert ISO-8859-1 and UTF-8 from/to that uniChar
3. Migrate all the code
4. Talk to people about support UTF-16 or change uniChar to 4 bytes after you proof changing 1-3 bring in a lot of value and does not cause too much performance/footprint issue.
==================================
Frank Yung-Fong Tang System Architect, I�t�rn�ti�n�l D�v�l�pme�t, AOL Int�r��t�v� S�rvi�es AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? -> Basic Conceptof Thai Language linked from Frank Tang's I�t�rn�ti�n�liz�ti�n Secrets Want to translate your English text to something Thailand users can understand ? -> Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/ |
- Re: UTF-16 inside UTF-8 Philippe Verdy
- Ill-formed sequences (was: Re: UTF-16 inside UT... Doug Ewell
- RE: Ill-formed sequences (was: Re: UTF-16 i... Addison Phillips [wM]
- Re: Ill-formed sequences (was: Re: UTF-... Doug Ewell
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 Doug Ewell
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 Peter Kirk
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 Doug Ewell
- Re: UTF-16 inside UTF-8 Philippe Verdy
- Re: UTF-16 inside UTF-8 Philippe Verdy
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 Philippe Verdy
- Re: UTF-16 inside UTF-8 Doug Ewell
- Re: UTF-16 inside UTF-8 YTang0648
- Re: UTF-16 inside UTF-8 Doug Ewell
- Re: UTF-16 inside UTF-8 John Cowan

