Re: [lazarus] UTF-8 vs UTF-16 support
Zitat von Luca Olivetti <[EMAIL PROTECTED]>: > En/na Mattias Gärtner ha escrit: > > > For most string operations, like computing the byte length or comparing > strings > > ASCII case insensitive, UTF-8 is 100% compatible. > > but not if you need char length, say limiting a text to 40 characters > and indicating there that the text has been truncated with '..': > > > if length(s)>40 then s:=copy(s,1,38)+'..'; > > or maybe faster > > if length(s)>40 then > begin >s[39]:='.'; >s[40]:='.'; >setlength(s,40); > end; > > would break with utf-8 (and with utf-16 too if you use characters > outside the bmp). There are probably utf-8 equivalents of the above, but > old habits die hard if UTF8Length(s)>40 then s:=UTF8Copy(s,1,38)+'..'; > Maybe for internal processing utf-32 is better and only use utf-8 for > input/output and/or interface with other systems? :) Speed: Depends on what you do: UTF-8, UTF-16, UTF-32 Memory: UTF-8 or UTF-16. Compatibility: UTF-8 (VCL) Easy coding: UTF-32 There is no absolute winner. Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
En/na Mattias Gärtner ha escrit: For most string operations, like computing the byte length or comparing strings ASCII case insensitive, UTF-8 is 100% compatible. but not if you need char length, say limiting a text to 40 characters and indicating there that the text has been truncated with '..': if length(s)>40 then s:=copy(s,1,38)+'..'; or maybe faster if length(s)>40 then begin s[39]:='.'; s[40]:='.'; setlength(s,40); end; would break with utf-8 (and with utf-16 too if you use characters outside the bmp). There are probably utf-8 equivalents of the above, but old habits die hard Maybe for internal processing utf-32 is better and only use utf-8 for input/output and/or interface with other systems? Bye -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007 _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
Zitat von Graeme Geldenhuys <[EMAIL PROTECTED]>: > On 08/10/2007, Razvan Adrian Bogdan <[EMAIL PROTECTED]> wrote: > > char would be nice too, maybe even implemented in FPC for UTF8string > > such as Lenght(utf8string) or indexing utf8string[1] to return the > > char not the byte as UTF32. > > In fpGUI I have a few helper functions for UTF-8 strings (Length, > Copy, Delete, Insert, Pos etc). Some of the code I got from LCLProc > unit and some written myself. > > Anybody know how I can access UTF-8 characters via a index? eg; > MyString[2] returns the string containing the 2nd character. I say > returning a string, because a UTF-8 characters can be between 1-4 > bytes so a Char type will not do. If you want an array, then you can convert the string to UTF-32 or create an array of PChar pointing to each character. If you just want the n-th utf-8 character, then you can use UTF8CharStart. If you need the n-th visible character (including BIDI and combined characters) then you must use functions from the iconv lib or the winapi. Same for UTF-16 and UTF-32. There is no encoding for random access to an unicode string. Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On 08/10/2007, Razvan Adrian Bogdan <[EMAIL PROTECTED]> wrote: > char would be nice too, maybe even implemented in FPC for UTF8string > such as Lenght(utf8string) or indexing utf8string[1] to return the > char not the byte as UTF32. In fpGUI I have a few helper functions for UTF-8 strings (Length, Copy, Delete, Insert, Pos etc). Some of the code I got from LCLProc unit and some written myself. Anybody know how I can access UTF-8 characters via a index? eg; MyString[2] returns the string containing the 2nd character. I say returning a string, because a UTF-8 characters can be between 1-4 bytes so a Char type will not do. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
Zitat von Razvan Adrian Bogdan <[EMAIL PROTECTED]>: > On 10/8/07, Luca Olivetti <[EMAIL PROTECTED]> wrote: > > En/na Luca Olivetti ha escrit: > > > > >> You have to go through the string for UTF-8 and UTF-16 encodings so > > >> the advantages are at least questionable... > > > > > > Yes, but my (wrong) premise is that you could assume all characters are > > > 2 bytes wide, so the Nth character would be at N*2 byte. > > > > BTW, using strings as arrays of char to get at individual characters is > > risky business with utf-8. It's the same with UTF-16 and with treating UTF-16 as UCS-2. UTF-32 is almost there. (some languages combine characters. I dont know the relevance.) For most string operations, like computing the byte length or comparing strings ASCII case insensitive, UTF-8 is 100% compatible. Because of the UTF-8 encoding, you can even start in the middle of string and find out if the byte is the first, second, third or fourth byte of a character. So, existing algorithms don't need to change at whole to work with UTF-8. Same is true for UCS-2 code and UTF-16. > > Or will be they converted to (pseudo) > > properties and (slowly) do the (slow) right thing? > > I also suppose that the functions in strutils are not utf-8 aware, so > > what should we be using in its place? > > For single character processing UTF32 (4bytes) would be nice :), i > think functions to count UTF8 chars inside a string and getting each > char would be nice too, maybe even implemented in FPC for UTF8string > such as Lenght(utf8string) or indexing utf8string[1] to return the > char not the byte as UTF32. See lcl/lclproc.pas search for UTF8. Some of these functions already exists in the RTL. The others may be moved eventually. > Since FPC uses ANSI strings, a lot and most text is in latin1 without > any diacritics using UTF8 in Lazarus is a good choice, if the right > functions are provided it can be a great choice unless apps become too > slow. In lazarus most UTF-8 code is in synedit. The synedit slow down from ASCII to UTF-8 was hardly measurable. Even if ignoring the fact that 90%-98% of the time is spent in the widgetset. > Since the web uses mostly UTF8 for minimizing transfered data and also > most databases for minimal storage size it becomes clear that UTF8 is > a better choice if helper functions exist to assist with it's > management. Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On 10/8/07, Luca Olivetti <[EMAIL PROTECTED]> wrote: > En/na Luca Olivetti ha escrit: > > >> You have to go through the string for UTF-8 and UTF-16 encodings so > >> the advantages are at least questionable... > > > > Yes, but my (wrong) premise is that you could assume all characters are > > 2 bytes wide, so the Nth character would be at N*2 byte. > > BTW, using strings as arrays of char to get at individual characters is > risky business with utf-8. Or will be they converted to (pseudo) > properties and (slowly) do the (slow) right thing? > I also suppose that the functions in strutils are not utf-8 aware, so > what should we be using in its place? For single character processing UTF32 (4bytes) would be nice :), i think functions to count UTF8 chars inside a string and getting each char would be nice too, maybe even implemented in FPC for UTF8string such as Lenght(utf8string) or indexing utf8string[1] to return the char not the byte as UTF32. Since FPC uses ANSI strings, a lot and most text is in latin1 without any diacritics using UTF8 in Lazarus is a good choice, if the right functions are provided it can be a great choice unless apps become too slow. Since the web uses mostly UTF8 for minimizing transfered data and also most databases for minimal storage size it becomes clear that UTF8 is a better choice if helper functions exist to assist with it's management. Razvan _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
En/na Luca Olivetti ha escrit: You have to go through the string for UTF-8 and UTF-16 encodings so the advantages are at least questionable... Yes, but my (wrong) premise is that you could assume all characters are 2 bytes wide, so the Nth character would be at N*2 byte. BTW, using strings as arrays of char to get at individual characters is risky business with utf-8. Or will be they converted to (pseudo) properties and (slowly) do the (slow) right thing? I also suppose that the functions in strutils are not utf-8 aware, so what should we be using in its place? Bye -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007 _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
En/na Marco Ciampa ha escrit: On Fri, Oct 05, 2007 at 01:14:23PM +0200, Luca Olivetti wrote: En/na [EMAIL PROTECTED] ha escrit: * WideString allows indexed "[]" accessing individual chars. This does not seem to be correct. I read that utf16 can be 4 byte long.. Then calculation is needed sometimes... Unless you're dealing with klingon and ancient languages, Like Chinese? Just a billion people use it...not a real problem at all... :-\ I (wrongly) thought that chines was in the bmp :-( I think you can assume that for 99.99% of currently spoken languages every character will be exactly 2 bytes long. Wrong as I said before. There's a risk of having some character with more that 2 bytes but it is a small risk. With utf-8 the risk is bigger, so you have always to traverse the string if you need access to a specific character index. You have to go through the string for UTF-8 and UTF-16 encodings so the advantages are at least questionable... Yes, but my (wrong) premise is that you could assume all characters are 2 bytes wide, so the Nth character would be at N*2 byte. Bye -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007 _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On Fri, Oct 05, 2007 at 01:14:23PM +0200, Luca Olivetti wrote: > En/na [EMAIL PROTECTED] ha escrit: > >> * WideString allows indexed "[]" accessing individual chars. >> This does not seem to be correct. I read that utf16 can be 4 byte long.. >> Then calculation is needed sometimes... > > Unless you're dealing with klingon and ancient languages, Like Chinese? Just a billion people use it...not a real problem at all... :-\ > I think you can assume that for 99.99% of currently spoken languages every > character will be exactly 2 bytes long. Wrong as I said before. > There's a risk of having some character with more that 2 bytes but it is > a small risk. > With utf-8 the risk is bigger, so you have always to traverse > the string if you need access to a specific character index. You have to go through the string for UTF-8 and UTF-16 encodings so the advantages are at least questionable... ciao -- Marco Ciampa ++ | Linux User #78271 | | FSFE fellow #364 | ++ _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
Hi, I was surfing wikipedia and I found a good reason why not to use UCS-2. It seams to be prohibited to distribute software in mainland china that only partially supports the chinese characters (like is the case for UCS-2). Source: http://en.wikipedia.org/wiki/GB18030 "In a move of historic significance for software supporting Unicode, the PRC decided to mandate support of certain code points outside the BMP. This means that software can no longer get away with treating characters as 16 bit fixed width entities (UCS-2). Therefore they must either process the data in a variable width format (such as UTF-8 or UTF-16), which are the most common choices, or move to a larger fixed width format (such as UCS-4 or UTF-32). Microsoft made the change from UCS-2 to UTF-16 with Windows 2000." Of course, if your don't plan on distributing software on China, this is irrelevant, but a general purpose library needs to take this into account. -- Felipe Monteiro de Carvalho _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On 10/5/07, Luca Olivetti <[EMAIL PROTECTED]> wrote: > Unless you're dealing with klingon and ancient languages, I think you > can assume that for 99.99% of currently spoken languages every character > will be exactly 2 bytes long. You are forgetting about chinese. Some billion people speak it =) And you can't represent all chinese characters with ucs-2 -- Felipe Monteiro de Carvalho _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On Fri, 05 Oct 2007 13:14:23 +0200 Luca Olivetti <[EMAIL PROTECTED]> wrote: > En/na [EMAIL PROTECTED] ha escrit: > > > * WideString allows indexed "[]" accessing individual chars. > > > > This does not seem to be correct. I read that utf16 can be 4 byte > > long.. Then calculation is needed sometimes... > > Unless you're dealing with klingon and ancient languages, I think you > can assume that for 99.99% of currently spoken languages every > character will be exactly 2 bytes long. There's a risk of having some > character with more that 2 bytes but it is a small risk. > With utf-8 the risk is bigger, so you have always to traverse the > string if you need access to a specific character index. True. Programmers must decide, whether their programs can take the risk, not the LCL. And who knows, how unicode will change in future? Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
En/na [EMAIL PROTECTED] ha escrit: * WideString allows indexed "[]" accessing individual chars. This does not seem to be correct. I read that utf16 can be 4 byte long.. Then calculation is needed sometimes... Unless you're dealing with klingon and ancient languages, I think you can assume that for 99.99% of currently spoken languages every character will be exactly 2 bytes long. There's a risk of having some character with more that 2 bytes but it is a small risk. With utf-8 the risk is bigger, so you have always to traverse the string if you need access to a specific character index. -- Luca _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On Fri, 5 Oct 2007 10:45:18 +0200 ik <[EMAIL PROTECTED]> wrote: > On 10/5/07, Mattias Gaertner <[EMAIL PROTECTED]> wrote: > > On Fri, 05 Oct 2007 16:00:41 +0800 > > Paul Ishenin <[EMAIL PROTECTED]> wrote: > > > > > Graeme Geldenhuys wrote: > > > > Does this mean UTF-8 was chosen only because it is more > > > > compatible with existing pascal programs? Any other reasons? > > > > > > > > > > Is UTF-16 cover all languages? As I know it have problems with > > > Chinese and/or Japanese languages. While utf-8 doesnot have such > > > problems. More over most software uses English as default > > > language. UTF-8 encoded English words are still the same as > > > non-encoded English words. > > > > > > Btw, I dont know other advantages. > > > > UTF-8, UTF-16 and UTF-32 are just different encodings for the same > > unicode characterset. > > > > UTF-16 is often confused with UCS-2, which is indeed only 2-byte > > characters and has the widestring advantage (length=#words). But > > for the price, that it does not support all characters. That's why > > M$ switched from UCS-2 to UTF-16 keeping the W functions, which may > > be one of the main reasons for the confusion. > > As far as I know the Unicode organization no longer support in UCS-2 > and recommend that any implementation of such encoding will be used as > UTF-16. > > Another issue, is that on UTF-8 I think that some of the languages > such as Korean and Japanese does not include all of the symbols it > requires, but I'm not sure. > > I believe that all the encoding should be supported, and be used > according to the way that the developers of the software will decide > rather then to "force" them in choosing specific encoding. For compatibility, complexity and usability reasons the LCL should use only one encoding. For example TControl.Caption is a string on all platforms. There will be no CaptionW or CaptionA or CaptionUTF32, because this would be more confusing than it would help. Of course FPC/Laz provides converter functions for those prefering widestring or UTF-16 or UTF-32. The LCL are visual components, so the speed cost of converting the strings is hardly measurable against the cost of drawing the unicode characters on the screen. OTOH it can matter if you often traverse a tree with ten thousand nodes. Looking at the lazarus code the LCL encoding of UTF-8 was a good choice, because the multibyte encoding is only important in synedit and the LCL interfaces. With UTF-16 additional conversions would be needed for all text file operations including codetools, which would slow down a lot. Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
Graeme Geldenhuys wrote: Hi, I asked a similar question in the MSEgui newsgroup as well. What was the reason for choosing to support UTF-8 instead of UTF-16? - Quoted Mattias from 6 months ago -- The LCL will support UTF-8 and provide some extra functions for UTF-16, because UTF-8 is more compatible to existing pascal programs --- END -- Does this mean UTF-8 was chosen only because it is more compatible with existing pascal programs? Any other reasons? These are the pro points I received for using UTF-16 in MSEgui. * It is faster to work with UTF-16 (and so WideString) encoded text compared to UTF-8. * Easier to implement. * WideString allows indexed "[]" accessing individual chars. * Has predictable "length()" value. (not sure what they meant here) * Most widget toolkits and libraries have WideString API's already. (Win32, Xft, Xlib etc..) Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives * WideString allows indexed "[]" accessing individual chars. This does not seem to be correct. I read that utf16 can be 4 byte long.. Then calculation is needed sometimes... Marton Papp _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
Paul Ishenin wrote: Graeme Geldenhuys wrote: Does this mean UTF-8 was chosen only because it is more compatible with existing pascal programs? Any other reasons? Is UTF-16 cover all languages? As I know it have problems with Chinese and/or Japanese languages. While utf-8 doesnot have such problems. More over most software uses English as default language. UTF-8 encoded English words are still the same as non-encoded English words. Btw, I dont know other advantages. Best regards, Paul Ishenin. _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives As far as I read it , it does because one character is encoded as 2 byte or 4 byte. Marton Papp _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On 10/5/07, Mattias Gaertner <[EMAIL PROTECTED]> wrote: > On Fri, 05 Oct 2007 16:00:41 +0800 > Paul Ishenin <[EMAIL PROTECTED]> wrote: > > > Graeme Geldenhuys wrote: > > > Does this mean UTF-8 was chosen only because it is more compatible > > > with existing pascal programs? Any other reasons? > > > > > > > Is UTF-16 cover all languages? As I know it have problems with > > Chinese and/or Japanese languages. While utf-8 doesnot have such > > problems. More over most software uses English as default language. > > UTF-8 encoded English words are still the same as non-encoded English > > words. > > > > Btw, I dont know other advantages. > > UTF-8, UTF-16 and UTF-32 are just different encodings for the same > unicode characterset. > > UTF-16 is often confused with UCS-2, which is indeed only 2-byte > characters and has the widestring advantage (length=#words). But > for the price, that it does not support all characters. That's why M$ > switched from UCS-2 to UTF-16 keeping the W functions, which may be one > of the main reasons for the confusion. As far as I know the Unicode organization no longer support in UCS-2 and recommend that any implementation of such encoding will be used as UTF-16. Another issue, is that on UTF-8 I think that some of the languages such as Korean and Japanese does not include all of the symbols it requires, but I'm not sure. I believe that all the encoding should be supported, and be used according to the way that the developers of the software will decide rather then to "force" them in choosing specific encoding. > > > Mattias > Ido -- http://ik.homelinux.org/ _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On Fri, 05 Oct 2007 16:00:41 +0800 Paul Ishenin <[EMAIL PROTECTED]> wrote: > Graeme Geldenhuys wrote: > > Does this mean UTF-8 was chosen only because it is more compatible > > with existing pascal programs? Any other reasons? > > > > Is UTF-16 cover all languages? As I know it have problems with > Chinese and/or Japanese languages. While utf-8 doesnot have such > problems. More over most software uses English as default language. > UTF-8 encoded English words are still the same as non-encoded English > words. > > Btw, I dont know other advantages. UTF-8, UTF-16 and UTF-32 are just different encodings for the same unicode characterset. UTF-16 is often confused with UCS-2, which is indeed only 2-byte characters and has the widestring advantage (length=#words). But for the price, that it does not support all characters. That's why M$ switched from UCS-2 to UTF-16 keeping the W functions, which may be one of the main reasons for the confusion. Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On Fri, 5 Oct 2007 09:36:59 +0200 (CEST) Michael Van Canneyt <[EMAIL PROTECTED]> wrote: > > > On Fri, 5 Oct 2007, Graeme Geldenhuys wrote: > > > Hi, > > > > I asked a similar question in the MSEgui newsgroup as well. What > > was the reason for choosing to support UTF-8 instead of UTF-16? > > > > - Quoted Mattias from 6 months ago -- > > The LCL will support UTF-8 and provide some extra functions for > > UTF-16, because UTF-8 is more compatible to existing pascal programs > > --- END -- > > > > > > Does this mean UTF-8 was chosen only because it is more compatible > > with existing pascal programs? Any other reasons? > > It uses less memory. > > > > > These are the pro points I received for using UTF-16 in MSEgui. > > > > * It is faster to work with UTF-16 (and so WideString) encoded text > > compared to UTF-8. > > * Easier to implement. > > * WideString allows indexed "[]" accessing individual chars. > > * Has predictable "length()" value. (not sure what they meant here) > > It means BufferSize = Length*Sizeof(Widechar). This works only for 'most' languages, so this trick can only be used for specific applications. A LCL interface should support the full encoding, which means it must calculate the length of UTF-16. > On UTF-8, you need to calculate it. @Graeme: google for UTF-8 UTF-16 comparison to find lots of arguments for both sides. Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
Graeme Geldenhuys wrote: Does this mean UTF-8 was chosen only because it is more compatible with existing pascal programs? Any other reasons? Is UTF-16 cover all languages? As I know it have problems with Chinese and/or Japanese languages. While utf-8 doesnot have such problems. More over most software uses English as default language. UTF-8 encoded English words are still the same as non-encoded English words. Btw, I dont know other advantages. Best regards, Paul Ishenin. _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
Michael Van Canneyt schreef: On Fri, 5 Oct 2007, Graeme Geldenhuys wrote: Hi, I asked a similar question in the MSEgui newsgroup as well. What was the reason for choosing to support UTF-8 instead of UTF-16? - Quoted Mattias from 6 months ago -- The LCL will support UTF-8 and provide some extra functions for UTF-16, because UTF-8 is more compatible to existing pascal programs --- END -- Does this mean UTF-8 was chosen only because it is more compatible with existing pascal programs? Any other reasons? It uses less memory. These are the pro points I received for using UTF-16 in MSEgui. * It is faster to work with UTF-16 (and so WideString) encoded text compared to UTF-8. * Easier to implement. * WideString allows indexed "[]" accessing individual chars. * Has predictable "length()" value. (not sure what they meant here) It means BufferSize = Length*Sizeof(Widechar). On UTF-8, you need to calculate it. I think they mean numofchar(widestring) = bytes allocated / 2. For an UTF8 string you need to parse it, to get the length. So length(widestring) is a O(1) operation, lenght(UTF8String) is a O(n) operation. Vincent _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On Fri, 5 Oct 2007 09:27:59 +0200 "Graeme Geldenhuys" <[EMAIL PROTECTED]> wrote: > Hi, > > I asked a similar question in the MSEgui newsgroup as well. What was > the reason for choosing to support UTF-8 instead of UTF-16? > > - Quoted Mattias from 6 months ago -- > The LCL will support UTF-8 and provide some extra functions for > UTF-16, because UTF-8 is more compatible to existing pascal programs > --- END -- > > > Does this mean UTF-8 was chosen only because it is more compatible > with existing pascal programs? Any other reasons? > > These are the pro points I received for using UTF-16 in MSEgui. > > * It is faster to work with UTF-16 (and so WideString) encoded text > compared to UTF-8. > * Easier to implement. > * WideString allows indexed "[]" accessing individual chars. > * Has predictable "length()" value. (not sure what they meant here) This all assumes UTF-16 has only 2-byte characters, but there are 4-byte characters too. The above is true for UTF-32. > * Most widget toolkits and libraries have WideString API's already. > (Win32, Xft, Xlib etc..) And all platforms have functions for UTF-8. The main reason is: UTF-8 is more compatible to existing pascal programs, because they use 'string', not widestring. Mattias _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] UTF-8 vs UTF-16 support
On Fri, 5 Oct 2007, Graeme Geldenhuys wrote: > Hi, > > I asked a similar question in the MSEgui newsgroup as well. What was > the reason for choosing to support UTF-8 instead of UTF-16? > > - Quoted Mattias from 6 months ago -- > The LCL will support UTF-8 and provide some extra functions for UTF-16, > because UTF-8 is more compatible to existing pascal programs > --- END -- > > > Does this mean UTF-8 was chosen only because it is more compatible > with existing pascal programs? Any other reasons? It uses less memory. > > These are the pro points I received for using UTF-16 in MSEgui. > > * It is faster to work with UTF-16 (and so WideString) encoded text > compared to UTF-8. > * Easier to implement. > * WideString allows indexed "[]" accessing individual chars. > * Has predictable "length()" value. (not sure what they meant here) It means BufferSize = Length*Sizeof(Widechar). On UTF-8, you need to calculate it. Michael. _ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives