Re: [Lazarus] UTF8 RTL for Windows
On 11/25/2014 09:39 PM, Hans-Peter Diettrich wrote: The Delphi model already broke that claimed type safety, by omitting conversions of RawByteString results, for speed optimization. That's dangerous, because the compiler can *only* check the static type of string variables, but not the dynamic encoding of their contents This was clear to me just after exploring and understanding encoded strings in Delphi. In FPC/Lazarus we now have a *chance* for simplifications and improvements, when the new features are used in the *right* way. On that behalf I just posted a set of questions on the FPC Unicode support wiki page in the fpc-devel mailing list. Please continue this discussion there. But many arguments and opinions, presented in this thread, indicate to me an yet incomplete understanding and many misunderstandings, which I actually try to spot. See the new wiki page http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 11/24/2014 10:15 PM, Hans-Peter Diettrich wrote: I'm missing documentation for working safely (and efficiently) with such irregular strings, most probably none of the FPC (and Delphi) developers ever noticed how users are left alone with this problem :-( Hmm. In the fpc-devel, lazarus-devel, lists and in the German Lazarus Forum I had been involved in lots of long winding threads on this issue. So the developers do listen to the users ! Unfortunately the design of Delphi seems not to be really nice. AFAIK Details of it's behavior changed between the first versions that offered NewStrings. This suggests that the design goal was not well defined with the first issue and the facts that had been set with same, unfortunately needed to be sticked to to avoid breaking user code (which happened to my colleagues none the less). Unfortunately fpc seems to need to follow whatever Delphi. Michael (due to come back with a dedicated thread soon). -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Mattias Gaertner schrieb: On Mon, 24 Nov 2014 22:53:44 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: Graeme Geldenhuys schrieb: How is ThousandSeparator and DecimalSeparator supposed to work it TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian thousand separator (4-byte non-breaking white space character) for example will not fit into a Char type. The Char type is quite useless with Unicode, Correction: *This* Char type needs to be extended. Please specify. Char in general is very useful. at least if it has less than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, assuming that a character always fits into a Char (like the Pos overload...). There is a Pos overload for strings. Where is the flaw in Pos? The flaw is the added overload with a Char parameter. Furthermore the Pos arguments should never be subject to automatic conversion, otherwise the returned index will be useless. In the best case Char could be retyped into an string (substring), That would be wrong in 99.9% of the cases. Please give at least one example. DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Mattias Gaertner schrieb: On Mon, 24 Nov 2014 22:15:29 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: [...] The Delphi (and FPC) encoding model allows for strings of different static (declared) and dynamic (true content) encoding, see the special handling of RawByteString (Wiki). So far it's not a good idea to simply *assume* that a string variable contains bytes of the declared encoding. In detail one should check or force the right dynamic encoding of every string variable, before searching for specific bytes (chars) in it. I'm missing documentation for working safely (and efficiently) with such irregular strings, most probably none of the FPC (and Delphi) developers ever noticed how users are left alone with this problem :-( Maybe I don't understand the question, but it seems to me this is documented where static-, dynamic cp and rawbytestring are explained. More concrete questions: How can a user be sure that a string parameter in a subroutine has the specified encoding? How to check, how to fix if needed? http://wiki.freepascal.org/FPC_Unicode_support#Ansistring When a procedure requires a specific encoding it uses a specific String type. If it works with CP_ACP it uses String. If it needs UTF8 it uses UTF8String. Such specifications are meaningless when the string parameters can have a different dynamic encoding :-( Unicode Delphi works well as long as only one codepage (CP_ACP) is used, in addition to Unicode (UTF-16) strings. As soon as multiple codepages can be involved at the same time, the dynamic string encodings become almost random (observed in Delphi XE). FPC now already has multiple built-in codepage variables (DefaultSystemCodePage...), with possibly different values, so that the observed Delphi mess is inevitable, as long as RawByteString results (of e.g. standard stringhandling functions) are *not* converted when assigned to a string variable of some specific static encoding. Unfortunately I cannot test Lazarus trunk since a long time, no answer on my request for assistance. So I have to wait for the next installable download, before I can give concrete examples. DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Tue, 25 Nov 2014 13:10:26 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: [...] Maybe I don't understand the question, but it seems to me this is documented where static-, dynamic cp and rawbytestring are explained. More concrete questions: How can a user be sure that a string parameter in a subroutine has the specified encoding? How to check, how to fix if needed? As you know in general you cannot find out the encoding of a text. You have to trust that the caller gave the right encoding. This was true before 2.7.1 and it still is. The new thing with 2.7.1 is that String now has an encoding field and that you can use this to let the compiler convert encodings automatically. For example the RTL uses this to convert between OS strings and program strings. This means some RTL functions don't need manual encoding conversions (e.g. UTF8ToAnsi) anymore. You can simply pass the string. Hopefully more and more RTL functions/variables will be converted. In short: Most of the time you code exactly like before. If your code works with various encodings, then formerly you had to be very careful what you do with the strings. For example when you pass the strings to the RTL you had to convert them to the system codepage. Now you can use for instance UTF8String instead and omit the UTF8ToAnsi. It is like gaining some type safety. And you can now use SetCodePage. But then you have to be very careful again. http://wiki.freepascal.org/FPC_Unicode_support#Ansistring When a procedure requires a specific encoding it uses a specific String type. If it works with CP_ACP it uses String. If it needs UTF8 it uses UTF8String. Such specifications are meaningless when the string parameters can have a different dynamic encoding :-( Please read the paragraph Dynamic code page again. The example it describes is the most common case: the system code page. This is the same as FPC 2.6.5 and below. A String coming from the OS has the system code page, which is dynamic. If you want a specific encoding you had to convert it. With FPC 2.7.1 we have a new possibility. This is the new mode I was talking about. Now we get UTF-8 strings in many places in the RTL. Not all places yet. But we are working on it. And you can help. Unicode Delphi works well as long as only one codepage (CP_ACP) is used, in addition to Unicode (UTF-16) strings. As soon as multiple codepages can be involved at the same time, the dynamic string encodings become almost random (observed in Delphi XE). FPC now already has multiple built-in codepage variables (DefaultSystemCodePage...), with possibly different values, so that the observed Delphi mess is inevitable, as long as RawByteString results (of e.g. standard stringhandling functions) are *not* converted when assigned to a string variable of some specific static encoding. Well, two weeks ago I was rolling my eyes when I read about this complex system and DefaultSystemCodePage. But then I tried to set it and now we can use one String encoding cross platform and it works with file functions, TStringList and friends. Almost all of the UTF8ToSys calls are no longer needed and file functions now support full Unicode. We can write an Unicode program cross platform using our normal strings and classes. And it is pretty compatible. So from Lazarus point of view this is a great step forward. And last but not least: it is optional. Of course if you have a product and you have to support all old modes and some of the new possibilities you will curse. Unfortunately I cannot test Lazarus trunk since a long time, no answer on my request for assistance. ? Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Tue, 25 Nov 2014 11:53:00 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: [...] Correction: *This* Char type needs to be extended. Please specify. The ThousandSeparator type is Char, which does not work with Russian in UTF-8. Well, at least if you want the non breakable space instead of the normal space. There are many cases where Char is enough. [...] Char in general is very useful. ... at least if it has less than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, assuming that a character always fits into a Char (like the Pos overload...). There is a Pos overload for strings. Where is the flaw in Pos? The flaw is the added overload with a Char parameter. I use that a lot. It is faster than the string variant. Why is that a flaw? Furthermore the Pos arguments should never be subject to automatic conversion, otherwise the returned index will be useless. You can argue the same way in the direction: If it does not automatically convert it will find crap. In the best case Char could be retyped into an string (substring), That would be wrong in 99.9% of the cases. Please give at least one example. Retype Char to String and the compiler will bark. For example in Graphics. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Tue, Nov 25, 2014 at 2:45 PM, Mattias Gaertner nc-gaert...@netcologne.de wrote: Retype Char to String and the compiler will bark. For example in Graphics. What about changing to WideChar then? -- Felipe Monteiro de Carvalho -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Tue, 25 Nov 2014 14:49:52 +0100 Felipe Monteiro de Carvalho felipemonteiro.carva...@gmail.com wrote: On Tue, Nov 25, 2014 at 2:45 PM, Mattias Gaertner nc-gaert...@netcologne.de wrote: Retype Char to String and the compiler will bark. For example in Graphics. What about changing to WideChar then? If you mean unit Graphics: It checks for ASCII characters. So a change to WideChar would add implicit conversions without any gain. In case of ThousandSeparator: That would probably be sufficient. Although some code needs to be adapted. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Tue, Nov 25, 2014 at 3:14 PM, Mattias Gaertner nc-gaert...@netcologne.de wrote: What about changing to WideChar then? If you mean unit Graphics: It checks for ASCII characters. So a change to WideChar would add implicit conversions without any gain. In case of ThousandSeparator: That would probably be sufficient. Although some code needs to be adapted. I ment for ThousandSeparator. -- Felipe Monteiro de Carvalho -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
2014-11-25 14:45 GMT+01:00 Mattias Gaertner nc-gaert...@netcologne.de: On Tue, 25 Nov 2014 11:53:00 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: [...] Correction: *This* Char type needs to be extended. Please specify. The ThousandSeparator type is Char, which does not work with Russian in UTF-8. Well, at least if you want the non breakable space instead of the normal space. French uses non-breakable space too. According to several sources, the correct character should actually be narrow no-break space https://en.wikipedia.org/wiki/Thin_space. -- Frederic Da Vitoria (davitof) Membre de l'April - « promouvoir et défendre le logiciel libre » - http://www.april.org -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Mattias Gaertner schrieb: On Tue, 25 Nov 2014 13:10:26 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: [...] Maybe I don't understand the question, but it seems to me this is documented where static-, dynamic cp and rawbytestring are explained. More concrete questions: How can a user be sure that a string parameter in a subroutine has the specified encoding? How to check, how to fix if needed? As you know in general you cannot find out the encoding of a text. You have to trust that the caller gave the right encoding. This was true before 2.7.1 and it still is. The new thing with 2.7.1 is that String now has an encoding field and that you can use this to let the compiler convert encodings automatically. For example the RTL uses this to convert between OS strings and program strings. This means some RTL functions don't need manual encoding conversions (e.g. UTF8ToAnsi) anymore. You can simply pass the string. Hopefully more and more RTL functions/variables will be converted. In short: Most of the time you code exactly like before. FACK, so far :-] If your code works with various encodings, then formerly you had to be very careful what you do with the strings. For example when you pass the strings to the RTL you had to convert them to the system codepage. Now you can use for instance UTF8String instead and omit the UTF8ToAnsi. It is like gaining some type safety. The Delphi model already broke that claimed type safety, by omitting conversions of RawByteString results, for speed optimization. That's dangerous, because the compiler can *only* check the static type of string variables, but not the dynamic encoding of their contents. And you can now use SetCodePage. But then you have to be very careful again. SetCodePage is safe, as long as it enforces an according conversion of the dynamic string encoding. The option, of only changing the encoding field, is reserved for adjustments after reading strings from external sources, or from Char, Char arrays/pointers or ShortString, where the correct codepage is unknown to the compiler and library routines. http://wiki.freepascal.org/FPC_Unicode_support#Ansistring When a procedure requires a specific encoding it uses a specific String type. If it works with CP_ACP it uses String. If it needs UTF8 it uses UTF8String. Such specifications are meaningless when the string parameters can have a different dynamic encoding :-( Please read the paragraph Dynamic code page again. Please read my statement again, you still miss my point. With FPC 2.7.1 we have a new possibility. This is the new mode I was talking about. Now we get UTF-8 strings in many places in the RTL. Not all places yet. But we are working on it. And you can help. I'm trying to help all the time, but if you don't understand my arguments, I cannot help you :-( I've explored the encoded AnsiStrings in Delphi XE, years ago, and identified a couple of problems with the Delphi implementation. I can help by explaining these problems, and how to avoid or reduce these problems in FPC/Lazarus. But according fixes to legacy code must be applied by the maintainers of that code, who know about the *right* way (intended behaviour) to fix every single problem. Well, two weeks ago I was rolling my eyes when I read about this complex system and DefaultSystemCodePage. But then I tried to set it and now we can use one String encoding cross platform and it works with file functions, TStringList and friends. Almost all of the UTF8ToSys calls are no longer needed and file functions now support full Unicode. This was clear to me just after exploring and understanding encoded strings in Delphi. In FPC/Lazarus we now have a *chance* for simplifications and improvements, when the new features are used in the *right* way. But many arguments and opinions, presented in this thread, indicate to me an yet incomplete understanding and many misunderstandings, which I actually try to spot. DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Mattias Gaertner schrieb: On Tue, 25 Nov 2014 14:49:52 +0100 Felipe Monteiro de Carvalho felipemonteiro.carva...@gmail.com wrote: On Tue, Nov 25, 2014 at 2:45 PM, Mattias Gaertner nc-gaert...@netcologne.de wrote: Retype Char to String and the compiler will bark. For example in Graphics. What about changing to WideChar then? If you mean unit Graphics: It checks for ASCII characters. So a change to WideChar would add implicit conversions without any gain. You see that Unicode handling requires more than only changing declarations? [Where changing Char to Byte in Graphics might be sufficient, as long as such bytes are not kept in Strings] In case of ThousandSeparator: That would probably be sufficient. Although some code needs to be adapted. Then you should also see that certain means should at least *allow* to *identify* code that is not sufficiently Unicode-aware. This would not only allow the FPC/Lazarus developers to identify flaws in the standard libraries, but also users will appreciate spotted flaws in their legacy code :-) DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 11/23/2014 07:52 PM, Felipe Monteiro de Carvalho wrote: Well, the first reports of how the unicode rtl would look like were pretty scary: Total break of the string part of millions of lines of code that people wrote with Lazarus since years. That is why I stopped recommending Lazarus to my colleagues who are doing Delphi. They took a huge amount of pain to convert their software from Delphi one byte strings to Delphi two bytes strings. Hence they will not be pleased to be forced to convert back to one byte strings to be able to use Lazarus and some time later convert to two byte strings again once Lazarus might be forced to finally follow Delphi on that behalf. -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 11/22/2014 05:18 PM, Hans-Peter Diettrich wrote: Does this mean that Lazarus (new mode) ignores the OS system codepage setting? IMHO that would be just GREAT to allow for doing portable software. The RTL and LCL interface should be OS ignorant for portability. In user code, the user should be allowed to use the string encoding (and byte cont per character), he finds the most convenient for his application. OTOH this of course does provide a decent set of problems including but not limited to unnecessary conversions in certain cases. -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
2014-11-24 6:29 GMT-03:00 Michael Schnell mschn...@lumino.de: On 11/23/2014 07:52 PM, Felipe Monteiro de Carvalho wrote: Well, the first reports of how the unicode rtl would look like were pretty scary: Total break of the string part of millions of lines of code that people wrote with Lazarus since years. That is why I stopped recommending Lazarus to my colleagues who are doing Delphi. They took a huge amount of pain to convert their software from Delphi one byte strings to Delphi two bytes strings. Hence they will not be pleased to be forced to convert back to one byte strings to be able to use Lazarus and some time later convert to two byte strings again once Lazarus might be forced to finally follow Delphi on that behalf. If the program does not explicitely assumesa specific encoding, i.e. use only String type and do not do low level string handling, there will be no need to change. I did/do convert a lot of Delphi components and can assure that most will not need changes as is today Luiz -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 11/24/2014 11:44 AM, luiz americo pereira camara wrote: If the program does not explicitely assumesa specific encoding, i.e. use only String type and do not do low level string handling, there will be no need to change. I don't know the internals of the program(s). It's a huge system and does anything that somehow might be possible :-) . -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, Nov 24, 2014 at 11:33 AM, Michael Schnell mschn...@lumino.de wrote: IMHO that would be just GREAT to allow for doing portable software. The RTL and LCL interface should be OS ignorant for portability. In user code, the user should be allowed to use the string encoding (and byte cont per character), he finds the most convenient for his application. OTOH this of course does provide a decent set of problems including but not limited to unnecessary conversions in certain cases. See the request from Mattias : Please test and tell what you find out. Michael Schnell and others, let's keep this thread in a more congrete level. You can start another philosophical thread about how strings should be in a perfect world. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sun, 23 Nov 2014 21:37:56 -0300 luiz americo pereira camara luiz...@oi.com.br wrote: 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: [...] First of all: Thanks for testing. Without {$codepage utf8} directive String constants will get Code Page 0 (CP_ACP) and not the 1200 (UTF16 - UnicodeString). Beware: There are different types of string constants. String variables assigned to those constants will also have Code Page = 0 This is because the constant string code page is evaluated at compile time Not sure if there's a compiler command line param with same effect as {$codepage utf8} The attached program show how data loss can occur The program uses writeln, which converts to console CP. When you save the strings to a file you can see what they contain. Or write the byte values. This works with or without {$codepage utf8}: S := 'João'; // constant to (Ansi or Short)string W:=S; SUTF8:=S; const c: string = 'João'; W:=c; // constant to Wide/Unicode/UTF8String This requires {$codepage utf8} or -Fcutf8: W := 'João'; // constant to Wide/Unicode/UTF8string const c = 'João'; W:=c; I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For both modes. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, 24 Nov 2014 12:15:03 +0100 Mattias Gaertner nc-gaert...@netcologne.de wrote: [...] I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For both modes. On second thought: only for new mode. Passing it in the old mode will make the wide/unicode/utf8string work, but the Ansi/Shortstring will be wrong. We need a table in the wiki. FPC 2.6.5 and below, FPC 2.7.1+ and FPC 2.7.1+ with UTF8 as default CP. And with or without {$codepage utf8}. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 11/24/2014 12:01 PM, Juha Manninen wrote: See the request from Mattias : Please test and tell what you find out. I have not enough knowledge to be able to patch the compiler :-( let's keep this thread in a more congrete level. Agreed (even if I don't think that will lead to anything fairly portable.). As requested by Michael vC, I will do a Wiki page tomorrow and start a new Thread based on this. -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, 24 Nov 2014 13:12:04 +0100 Michael Schnell mschn...@lumino.de wrote: On 11/24/2014 12:01 PM, Juha Manninen wrote: See the request from Mattias : Please test and tell what you find out. I have not enough knowledge to be able to patch the compiler :-( I asked for testing compiling with -dEnableUTF8RTL. Don't hijack threads. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sun, 23 Nov 2014 18:27:12 -0300 luiz americo pereira camara luiz...@oi.com.br wrote: 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: [...] Please test and tell what you find out. The FormatSettings fields are still encoded with System Code Page regardless of DefaultSystemCodePage value. While for english locales there's no problem, other locales like PT-BR have accented names in days and monthes. The problem is in windows SysUtils.GetLocaleStr function that uses non unicode Win Api function. This problem will affect also the UnicodeString RTL. Attached is a test app that shows the issue. It also has a version of GetLocaleStr that fixes the issue for the RTL (both versions) Thanks. It works here too. I reported it: http://bugs.freepascal.org/view.php?id=27086 Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Michael Schnell schrieb: On 11/23/2014 07:52 PM, Felipe Monteiro de Carvalho wrote: Well, the first reports of how the unicode rtl would look like were pretty scary: Total break of the string part of millions of lines of code that people wrote with Lazarus since years. That is why I stopped recommending Lazarus to my colleagues who are doing Delphi. They took a huge amount of pain to convert their software from Delphi one byte strings to Delphi two bytes strings. I had similar problems, but only in porting a huge codebase from ShortString to AnsiString. The move from D5 to XE was painless then, only the uses lists deserved some updates. In so far it might be a good idea to educate some old-school Delphi coders, how to deal with managed strings and other past-BP items in general. As for Lazaurs, I think that UTF-8 is the best choice for multi-platform projects, with almost no extra conversions required on any platform. Please note that until now Windows did the Ansi to UTF conversions itself, in every API call with strings involved. If this was not noticed before, the conversions won't be noticeable afterwards as well. A move to UTF-16 instead will only favor Windows, while additional string conversions will be required on almost every other platform. I think that FPC/Lazarus should fork and support separate libraries (RTL...) for UTF-8 and UTF-16 strings, if compatibility with newer Delphi VCL projects is desired. Full Delphi compatibility would also require a FireMonkey replacement for the LCL, and that were another very new project, extending the UTF-16 branch (only). Just my 0.02€ DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 11/24/2014 02:19 PM, Hans-Peter Diettrich wrote: A move to UTF-16 instead will only favor Windows, Regarding the RTL interface, you of course are right. Doing the user software with UTF-16 instead of RTZF-8 strings, in many cases (but of course not perfectly) allows for keeping old-style 1-Byte ANSI code using s[n], and manually using the result of pos(). -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Am 24.11.2014 14:55 schrieb Hans-Peter Diettrich drdiettri...@aol.com: Please note that until now Windows did the Ansi to UTF conversions itself, in every API call with strings involved. If this was not noticed before, the conversions won't be noticeable afterwards as well. This is something that one definitely shoudln't forget! Up to now Windows did the conversion for us and do we see people complaining about the conversion during API calls? No, we don't... Regards, Sven -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 11/24/2014 02:50 PM, Hans-Peter Diettrich wrote: code, the user should be allowed to use the string encoding (and byte cont per character), he finds the most convenient for his application. I'm not sure what exactly you mean here. Here I menat that for a *new project* the user might be willing to choose e.g. either UTF-16 (sometimes easier to use) or utf-8 (sometimes faster and less memory overhead) for his own code, while the RTL might be done specifically in favor of the OS. -Michael -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Please don't start an UTF war again. This has been discussed in length and a zillion times. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 2014-11-24 10:52, Michael Schnell wrote: I don't know the internals of the program(s). It's a huge system and does anything that somehow might be possible :-) . Luckily you have everything unit tested right. So it would simply be a case of running the test suite to see what works and what doesn't. ;-) Regards, - Graeme - -- fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal http://fpgui.sourceforge.net/ -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
2014-11-24 8:15 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: On Sun, 23 Nov 2014 21:37:56 -0300 luiz americo pereira camara luiz...@oi.com.br wrote: The attached program show how data loss can occur The program uses writeln, which converts to console CP. When you save the strings to a file you can see what they contain. Or write the byte values. Yes. I improved the program (see message that followed) to write the bytes values so the comparison should be more exact. This works with or without {$codepage utf8}: S := 'João'; // constant to (Ansi or Short)string Without {$codepage utf8} When DefaultSystemCodePage is CP_ACP the variable S will have the content of UTF8 but the encoding will be ACP (in my case 1252), just like is today. With DefaultSystemCodePage as CP_UTF8 both content and code page will match [..] I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For both modes. Probably yes. There's one case that must be tested. When the file is encoded in ansi like those shared with Delphi. What i understand with -Fcutf8, the compiler will interpret those content as UTF8 creating wrong encoded constant. $codepage directive overrides -Fcutf8? If so, to fix the developer could use $codepage with the correct file encoding Luiz -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, 24 Nov 2014 12:45:54 -0300 luiz americo pereira camara luiz...@oi.com.br wrote: 2014-11-24 8:15 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: [...] This works with or without {$codepage utf8}: S := 'João'; // constant to (Ansi or Short)string Without {$codepage utf8} When DefaultSystemCodePage is CP_ACP the variable S will have the content of UTF8 but the encoding will be ACP (in my case 1252), just like is today. With DefaultSystemCodePage as CP_UTF8 both content and code page will match Yes, but CP_ACP is treated as CP_UTF8. So it does not matter. [..] I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For both modes. Probably yes. There's one case that must be tested. When the file is encoded in ansi like those shared with Delphi. What i understand with -Fcutf8, the compiler will interpret those content as UTF8 creating wrong encoded constant. Yes. $codepage directive overrides -Fcutf8? Yes. If so, to fix the developer could use $codepage with the correct file encoding Yes. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 2014-11-22 16:38, Michael Van Canneyt wrote: The exact behaviour of the RTL is controlled by a couple of variables: DefaultSystemCodePage, DefaultFileSystemCodePage , DefaultRTLFileSystemCodePage. I've read the updated wiki page, but still confused about something... TFormatSettings = record CurrencyFormat: Byte; NegCurrFormat: Byte; ThousandSeparator: Char; DecimalSeparator: Char; ...snip... How is ThousandSeparator and DecimalSeparator supposed to work it TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian thousand separator (4-byte non-breaking white space character) for example will not fit into a Char type. I haven't read this whole thread yet, and haven't played with the latest FPC 2.7.1 yet - so maybe I'm just missing some key information for now. Or is TFormatSettings just something that hasn't yet been converted to be Unicode friendly? Regards, - Graeme - -- fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal http://fpgui.sourceforge.net/ -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, 24 Nov 2014 16:25:15 + Graeme Geldenhuys mailingli...@geldenhuys.co.uk wrote: [...] Or is TFormatSettings just something that hasn't yet been converted to be Unicode friendly? It has not yet been converted. We can help the FPC team by collecting all places. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 2014-11-24 16:36, Mattias Gaertner wrote: It has not yet been converted. Many thanks for confirming that. We can help the FPC team by collecting all places. Where should we report this? Mantis or Unicode page of the Wiki? Regards, - Graeme - -- fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal http://fpgui.sourceforge.net/ -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
luiz americo pereira camara schrieb: When DefaultSystemCodePage is CP_ACP the variable S will have the content of UTF8 but the encoding will be ACP (in my case 1252), just like is today. With DefaultSystemCodePage as CP_UTF8 both content and code page will match The Delphi (and FPC) encoding model allows for strings of different static (declared) and dynamic (true content) encoding, see the special handling of RawByteString (Wiki). So far it's not a good idea to simply *assume* that a string variable contains bytes of the declared encoding. In detail one should check or force the right dynamic encoding of every string variable, before searching for specific bytes (chars) in it. I'm missing documentation for working safely (and efficiently) with such irregular strings, most probably none of the FPC (and Delphi) developers ever noticed how users are left alone with this problem :-( DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Graeme Geldenhuys schrieb: How is ThousandSeparator and DecimalSeparator supposed to work it TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian thousand separator (4-byte non-breaking white space character) for example will not fit into a Char type. The Char type is quite useless with Unicode, at least if it has less than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, assuming that a character always fits into a Char (like the Pos overload...). In the best case Char could be retyped into an string (substring), so that it can hold any Unicode character *and* its encoding. Unicode stringhandling in general should always use substrings, for the same reasons. Until then 99.9% of occurences of Char in UTF-8 aware library or application code can be considered bugs :-( The FPC team can sort out the real low-level code (most probably only the string conversion routines), the rest will become Delphi incompatible when fixed. DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, 24 Nov 2014 22:15:29 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: [...] The Delphi (and FPC) encoding model allows for strings of different static (declared) and dynamic (true content) encoding, see the special handling of RawByteString (Wiki). So far it's not a good idea to simply *assume* that a string variable contains bytes of the declared encoding. In detail one should check or force the right dynamic encoding of every string variable, before searching for specific bytes (chars) in it. I'm missing documentation for working safely (and efficiently) with such irregular strings, most probably none of the FPC (and Delphi) developers ever noticed how users are left alone with this problem :-( Maybe I don't understand the question, but it seems to me this is documented where static-, dynamic cp and rawbytestring are explained. http://wiki.freepascal.org/FPC_Unicode_support#Ansistring When a procedure requires a specific encoding it uses a specific String type. If it works with CP_ACP it uses String. If it needs UTF8 it uses UTF8String. If it can work with any 8-bit encoding it uses RawByteString. If you need it even more detailed use the StringCodePage function. What else do you need? Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, 24 Nov 2014 22:53:44 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: Graeme Geldenhuys schrieb: How is ThousandSeparator and DecimalSeparator supposed to work it TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian thousand separator (4-byte non-breaking white space character) for example will not fit into a Char type. The Char type is quite useless with Unicode, Correction: *This* Char type needs to be extended. Char in general is very useful. at least if it has less than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, assuming that a character always fits into a Char (like the Pos overload...). There is a Pos overload for strings. Where is the flaw in Pos? In the best case Char could be retyped into an string (substring), That would be wrong in 99.9% of the cases. so that it can hold any Unicode character *and* its encoding. Unicode stringhandling in general should always use substrings, for the same reasons. Until then 99.9% of occurences of Char in UTF-8 aware library or application code can be considered bugs :-( The FPC team can sort out the real low-level code (most probably only the string conversion routines), the rest will become Delphi incompatible when fixed. Please give real world examples. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Mon, 24 Nov 2014 16:40:06 + Graeme Geldenhuys mailingli...@geldenhuys.co.uk wrote: [...] Where should we report this? Mantis or Unicode page of the Wiki? On a second thought, a programmer need to know what might fail and the alternative/workaround. The latter depends on settings. In case of the new LCL mode we can extend the LCL Unicode support page. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Am 23.11.2014 00:15 schrieb Mattias Gaertner nc-gaert...@netcologne.de: Additionally, most basic File I/O routines now correctly call the underlying OS-es file routines with the codepage the OS expects (which is WideString on Windows). Is it safe to say UTF-16? Or are there still UCS-2 Windows? Till NT 4 inclusive it's UCS-2, since Windows 2000 it's UTF-16 (I don't know and especially don't care about 9x). Regards, Sven -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sun, 23 Nov 2014, Mattias Gaertner wrote: True. Although many programmers misunderstand what this means. It is not as scary as it sounds. To all the scared people: Don't worry. Computers are not scary, not really. Just look at Terminator (or any other Sci-Fi involving computers), the humans always win in the end... :-) Additionally, most basic File I/O routines now correctly call the underlying OS-es file routines with the codepage the OS expects (which is WideString on Windows). Is it safe to say UTF-16? Or are there still UCS-2 Windows? I think some older versions of Windows are still UCS2, but I believe as of Windows 2000, it is all UTF-16. However, I am not an expert. The exact behaviour of the RTL is controlled by a couple of variables: DefaultSystemCodePage, DefaultFileSystemCodePage , DefaultRTLFileSystemCodePage. Yes, that's the important bit that FPC made better than Delphi. :) Phew... At least something we did better in the whole string mess ... ;) Anyway, I was just trying to say that a 1-byte string is not necessarily UTF-8 in FPC 2.7.1. Michael. -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sun, 23 Nov 2014 13:56:42 +0100 (CET) Michael Van Canneyt mich...@freepascal.org wrote: [...] Anyway, I was just trying to say that a 1-byte string is not necessarily UTF-8 in FPC 2.7.1. Yes, you can still store anything you like in strings. And you can store UTF-8 in a string and say it is not. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 2014-11-23 12:56, Michael Van Canneyt wrote: the humans always win in the end... :-) ROFL Phew... At least something we did better in the whole string mess ... ;) 9/10 times FPC does everything better than Delphi. Regards, - Graeme - -- fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal http://fpgui.sourceforge.net/ -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sun, Nov 23, 2014 at 1:56 PM, Michael Van Canneyt mich...@freepascal.org wrote: Don't worry. Computers are not scary, not really. Just look at Terminator (or any other Sci-Fi involving computers), the humans always win in the end... :-) Well, the first reports of how the unicode rtl would look like were pretty scary: Total break of the string part of millions of lines of code that people wrote with Lazarus since years. But now reading the latest report of how it will work out, i.e. that Char=WideChar only in a special mode, and that you can set some variables to get UTF-8 strings from RTL system calls, well, I haven't actually tested it yet, but it looks like that maybe our code will not break and maybe we won't need to review/fix hundreds of thousands of lines of code that have worked since years -- Felipe Monteiro de Carvalho -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: 2. The new mode: The LCL, FCL and RTL treat all String as UTF-8 encoded. Most RTL file functions now work with full Unicode. For example FileExists and aStringList.LoadFromFile(Filename) now support full Unicode. [..] Please test and tell what you find out. The FormatSettings fields are still encoded with System Code Page regardless of DefaultSystemCodePage value. While for english locales there's no problem, other locales like PT-BR have accented names in days and monthes. The problem is in windows SysUtils.GetLocaleStr function that uses non unicode Win Api function. This problem will affect also the UnicodeString RTL. Attached is a test app that shows the issue. It also has a version of GetLocaleStr that fixes the issue for the RTL (both versions) Luiz program TestUTF8FormatSettings; {$mode objfpc}{$H+} uses {$ifdef Windows} Windows, {$endif} Classes, sysutils; {$ifdef Windows} function GetLocaleStrTest(LID, LT: Longint; const Def: string): String; var L: Integer; Buf: array[0..255] of WideChar; W: WideString; begin L := GetLocaleInfoW(LID, LT, Buf, SizeOf(Buf)); if L 0 then begin //SetString(Result, PWideChar(@Buf[0]), L - 1) leads to wrong result //Bug in Procedure SetString (Out S : AnsiString; Buf : PWideChar; Len : SizeInt) ? SetString(W, PWideChar(@Buf[0]), L - 1); Result := W; end else Result := Def; end; {$endif} var i: Integer; S: String; List: TStringList; begin WriteLn('DefaultSystemCodePage: ', DefaultSystemCodePage); DefaultSystemCodePage:=CP_UTF8; DefaultRTLFileSystemCodePage:=CP_UTF8; List := TStringList.Create; for i := 1 to 12 do begin Write(StringCodePage(DefaultFormatSettings.LongMonthNames[i]), ' - '); WriteLn(DefaultFormatSettings.LongMonthNames[i]); List.Add(DefaultFormatSettings.LongMonthNames[i]); end; for i := 1 to 7 do begin Write(StringCodePage(DefaultFormatSettings.LongDayNames[i]), ' - '); WriteLn(DefaultFormatSettings.LongDayNames[i]); List.Add(DefaultFormatSettings.LongDayNames[i]); end; {$ifdef Windows} S := GetLocaleStrTest(GetThreadLocale, LOCALE_SDAYNAME1+1, 'xx'); Write(StringCodePage(S), ' - '); WriteLn(S); List.Add(S); {$endif} List.SaveToFile('TestUTF8FormatSettingsOut.txt'); List.Destroy; end. -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: Please test and tell what you find out. Without {$codepage utf8} directive String constants will get Code Page 0 (CP_ACP) and not the 1200 (UTF16 - UnicodeString). String variables assigned to those constants will also have Code Page = 0 This is because the constant string code page is evaluated at compile time Not sure if there's a compiler command line param with same effect as {$codepage utf8} The attached program show how data loss can occur Luiz program testStringConstantCP; {$mode objfpc}{$H+} uses Classes, sysutils; var W: UnicodeString; S, S_2: String; SUTF8, SUTF8_2: UTF8String; begin SetMultiByteConversionCodePage(CP_UTF8); W := 'João'; Write('W: ': 10, StringCodePage(W), ' - '); WriteLn(W); S := 'João'; Write('S: ': 10,StringCodePage(S), ' - '); WriteLn(S); S_2 := W; Write('S_2: ': 10,StringCodePage(S_2), ' - '); WriteLn(S_2); SUTF8 := W; Write('SUTF8: ': 10,StringCodePage(SUTF8), ' - '); WriteLn(SUTF8); SUTF8_2 := S; Write('SUTF8_2: ': 10, StringCodePage(SUTF8_2), ' - '); WriteLn(SUTF8_2); end. -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
I added {.$codepage utf8} and all strings output as Joao. Got confused. I did not to expect changes in the constant assigned to the UnicodeString variable Need to check what is the correct UTF8 output: JoA£o or Joao Luiz 2014-11-23 21:37 GMT-03:00 luiz americo pereira camara luiz...@oi.com.br: 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: Please test and tell what you find out. Without {$codepage utf8} directive String constants will get Code Page 0 (CP_ACP) and not the 1200 (UTF16 - UnicodeString). String variables assigned to those constants will also have Code Page = 0 This is because the constant string code page is evaluated at compile time Not sure if there's a compiler command line param with same effect as {$codepage utf8} The attached program show how data loss can occur Luiz -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
I updated the test app to show the hexadecimal representation of the string. When {$codepage utf8} is set, all string encoding and content is right matching each other regardless of MultiByteConversionCodePage Without {$codepage utf8}: When MultiByteConversionCodePage is CP_ACP (default) one string gets the UTF8 content but code page is system ansi (1252 in my case) When MultiByteConversionCodePage is UTF8 and two strings (converted from WideString) get code page UTF8 but content is wrong Luiz 2014-11-23 22:06 GMT-03:00 luiz americo pereira camara luiz...@oi.com.br: I added {.$codepage utf8} and all strings output as Joao. Got confused. I did not to expect changes in the constant assigned to the UnicodeString variable Need to check what is the correct UTF8 output: JoA£o or Joao Luiz 2014-11-23 21:37 GMT-03:00 luiz americo pereira camara luiz...@oi.com.br : 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de: Please test and tell what you find out. Without {$codepage utf8} directive String constants will get Code Page 0 (CP_ACP) and not the 1200 (UTF16 - UnicodeString). String variables assigned to those constants will also have Code Page = 0 This is because the constant string code page is evaluated at compile time Not sure if there's a compiler command line param with same effect as {$codepage utf8} The attached program show how data loss can occur Luiz program testStringConstantCP; {$mode objfpc}{$H+} {.$codepage utf8} uses Classes, sysutils; function StrToHex(const S: String): String; var i: Integer; begin Result := ''; if S = '' then Exit; for i := 1 to Length(S) do begin Result := Result + IntToHex(Byte(S[i]), 0); end; end; var W: UnicodeString; S, S_2: String; SUTF8, SUTF8_2: UTF8String; begin SetMultiByteConversionCodePage(CP_UTF8); W := 'ã'; Write('W: ': 10, StringCodePage(W): 6, ' - '); WriteLn(W: 6); S := 'ã'; Write('S: ': 10,StringCodePage(S): 6, ' - '); WriteLn(S: 6, ' - ', StrToHex(S)); S_2 := W; Write('S_2: ': 10,StringCodePage(S_2): 6, ' - '); WriteLn(S_2: 6, ' - ', StrToHex(S_2)); SUTF8 := W; Write('SUTF8: ': 10,StringCodePage(SUTF8): 6, ' - '); WriteLn(SUTF8: 6, ' - ', StrToHex(SUTF8)); SUTF8_2 := S; Write('SUTF8_2: ': 10, StringCodePage(SUTF8_2): 6, ' - '); WriteLn(SUTF8_2: 6, ' - ', StrToHex(SUTF8_2)); end. -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 24.11.2014 01:37, luiz americo pereira camara wrote: 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de mailto:nc-gaert...@netcologne.de: Please test and tell what you find out. Without {$codepage utf8} directive String constants will get Code Page 0 (CP_ACP) and not the 1200 (UTF16 - UnicodeString). String variables assigned to those constants will also have Code Page = 0 This is because the constant string code page is evaluated at compile time Not sure if there's a compiler command line param with same effect as {$codepage utf8} The attached program show how data loss can occur The command line parameter for this is -Fcutf8. Regards, Sven -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On 24.11.2014 03:19, luiz americo pereira camara wrote: I updated the test app to show the hexadecimal representation of the string. When {$codepage utf8} is set, all string encoding and content is right matching each other regardless of MultiByteConversionCodePage Without {$codepage utf8}: When MultiByteConversionCodePage is CP_ACP (default) one string gets the UTF8 content but code page is system ansi (1252 in my case) When MultiByteConversionCodePage is UTF8 and two strings (converted from WideString) get code page UTF8 but content is wrong Yes. $codepage is for the how the compiler parses the constants while MultiByteConversionCodePage is for the runtime behavior. In theory this is all documented at http://wiki.freepascal.org/FPC_Unicode_support#String_constants Regards, Sven -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Am 2014-11-20 um 17:21 schrieb Mattias Gaertner: The development version of FPC 2.7.1 has extended Strings and many RTL functions now work for codepages other than the system codepage. 2. The new mode: The LCL, FCL and RTL treat all String as UTF-8 encoded. ... When accessing the WinAPI you must use the W functions or use UTF8ToWinCP and WinCPToUTF8. Is this correct? The W functions of the WinAPI expect UTF16 so a conversion needs to be done in both cases, either to System code page or to UTF16. Or can we use STRING with WinAPI W functions directly? -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sat, 22 Nov 2014 14:37:00 +0100 Jürgen Hestermann juergen.hesterm...@gmx.de wrote: Am 2014-11-20 um 17:21 schrieb Mattias Gaertner: The development version of FPC 2.7.1 has extended Strings and many RTL functions now work for codepages other than the system codepage. 2. The new mode: The LCL, FCL and RTL treat all String as UTF-8 encoded. ... When accessing the WinAPI you must use the W functions or use UTF8ToWinCP and WinCPToUTF8. Is this correct? The W functions of the WinAPI expect UTF16 so a conversion needs to be done in both cases, either to System code page or to UTF16. Or can we use STRING with WinAPI W functions directly? You can use them directly. For example: procedure TForm1.FormCreate(Sender: TObject); var s: string; // String = AnsiString because of $H+ begin s:=GetCommandLineW; // GetCommandLineW returns a UTF-16 PWideChar // the compiler adds code to convert this to the // default system codepage (CP_ACP = CP_UTF8) // the resulting string has StringCodePage CP_ACP // and is encoded in UTF-8. // therefore you can simply use it with the LCL Memo1.Lines.Add(s); end; You will get a compiler warning (id 4105), that WideString to Ansistring might loose data. The warning is right if the default string codepage is not UTF-8. If your code only runs with the RTL in UTF-8 mode, you can disable this warning. As alternative you can use: s:=UTF8Encode(GetCommandLineW); You must also use UTF8Encode if your code should run with both FPC 2.6.4 and 2.7.1. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Am 2014-11-22 um 15:06 schrieb Mattias Gaertner: procedure TForm1.FormCreate(Sender: TObject); var s: string; // String = AnsiString because of $H+ begin s:=GetCommandLineW; // GetCommandLineW returns a UTF-16 PWideChar // the compiler adds code to convert this to the // default system codepage (CP_ACP = CP_UTF8) // the resulting string has StringCodePage CP_ACP // and is encoded in UTF-8. // therefore you can simply use it with the LCL Okay. Does that mean that the compiler *always* assumes that String=UTF-8 encoded AnsiString and converts to other (known) encoded string types if needed? -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sat, 22 Nov 2014 16:18:09 +0100 Jürgen Hestermann juergen.hesterm...@gmx.de wrote: Am 2014-11-22 um 15:06 schrieb Mattias Gaertner: procedure TForm1.FormCreate(Sender: TObject); var s: string; // String = AnsiString because of $H+ begin s:=GetCommandLineW; // GetCommandLineW returns a UTF-16 PWideChar // the compiler adds code to convert this to the // default system codepage (CP_ACP = CP_UTF8) // the resulting string has StringCodePage CP_ACP // and is encoded in UTF-8. // therefore you can simply use it with the LCL Okay. Does that mean that the compiler *always* assumes that String=UTF-8 encoded AnsiString Yes, with the UTF8 RTL. The default RTL uses system codepage. and converts to other (known) encoded string types if needed? Yes. That's the new feature of FPC 2.7.1. What other encoded string types do you have in mind? Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sat, 22 Nov 2014, Mattias Gaertner wrote: On Sat, 22 Nov 2014 16:18:09 +0100 Jürgen Hestermann juergen.hesterm...@gmx.de wrote: Am 2014-11-22 um 15:06 schrieb Mattias Gaertner: procedure TForm1.FormCreate(Sender: TObject); var s: string; // String = AnsiString because of $H+ begin s:=GetCommandLineW; // GetCommandLineW returns a UTF-16 PWideChar // the compiler adds code to convert this to the // default system codepage (CP_ACP = CP_UTF8) // the resulting string has StringCodePage CP_ACP // and is encoded in UTF-8. // therefore you can simply use it with the LCL Okay. Does that mean that the compiler *always* assumes that String=UTF-8 encoded AnsiString Yes, with the UTF8 RTL. The default RTL uses system codepage. Careful, there is no such thing as the UTF8 RTL. There is now a Unicode and CodePage-aware RTL. That means it has: - Codepage aware single-byte strings. The codepage of a string may, or may not, be UTF8 (i.e. Unicode). - Widestrings (unicode). The compiler handles conversion of codepages transparantly. The codepage aware single-byte strings are not automatically UTF-8. On linux, this is probably so. But on windows, this is not necessarily so, Additionally, most basic File I/O routines now correctly call the underlying OS-es file routines with the codepage the OS expects (which is WideString on Windows). The exact behaviour of the RTL is controlled by a couple of variables: DefaultSystemCodePage, DefaultFileSystemCodePage , DefaultRTLFileSystemCodePage. See http://wiki.freepascal.org/FPC_Unicode_support. Michael.-- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
Mattias Gaertner schrieb: // GetCommandLineW returns a UTF-16 PWideChar // the compiler adds code to convert this to the // default system codepage (CP_ACP = CP_UTF8) // the resulting string has StringCodePage CP_ACP // and is encoded in UTF-8. Does this mean that Lazarus (new mode) ignores the OS system codepage setting? DoDi -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sat, 22 Nov 2014 17:18:35 +0100 Hans-Peter Diettrich drdiettri...@aol.com wrote: Mattias Gaertner schrieb: // GetCommandLineW returns a UTF-16 PWideChar // the compiler adds code to convert this to the // default system codepage (CP_ACP = CP_UTF8) // the resulting string has StringCodePage CP_ACP // and is encoded in UTF-8. Does this mean that Lazarus (new mode) ignores the OS system codepage setting? To be exact: Lazarus unit fpcadds sets the default string encoding (DefaultSystemCodePage) to CP_UTF8. The OS system codepage of Windows is not changed. All non W (e.g. A) functions still return and expect strings in the Windows system codepage. You can convert between UTF8 and Windows system codepage with UTF8ToWinCP and WinCPToUTF8. So, yes, a LCL application can now mostly ignore the system codepage. Finding the exceptions and traps is the goal of this mail thread. Please test and report what you find out. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Sat, 22 Nov 2014 17:38:33 +0100 (CET) Michael Van Canneyt mich...@freepascal.org wrote: [...] Yes, with the UTF8 RTL. The default RTL uses system codepage. Careful, there is no such thing as the UTF8 RTL. There is now a Unicode and CodePage-aware RTL. Well, yes, you are right of course. But Unicode and CodePage-aware RTL set to UTF-8 is an awkwardly long title. Also many users think that the new string types will break all their code and add lots of overhead. I want to advertise, that this is not so. On the contrary, it is very compatible, you get cross platform Unicode and the overhead is pretty small. And last but not least: Programming Unicode has become easier, because string encoding is now more consistent. That means it has: - Codepage aware single-byte strings. The codepage of a string may, or may not, be UTF8 (i.e. Unicode). - Widestrings (unicode). The compiler handles conversion of codepages transparantly. The codepage aware single-byte strings are not automatically UTF-8. On linux, this is probably so. But on windows, this is not necessarily so, True. Although many programmers misunderstand what this means. It is not as scary as it sounds. Additionally, most basic File I/O routines now correctly call the underlying OS-es file routines with the codepage the OS expects (which is WideString on Windows). Is it safe to say UTF-16? Or are there still UCS-2 Windows? The exact behaviour of the RTL is controlled by a couple of variables: DefaultSystemCodePage, DefaultFileSystemCodePage , DefaultRTLFileSystemCodePage. Yes, that's the important bit that FPC made better than Delphi. :) See http://wiki.freepascal.org/FPC_Unicode_support. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] UTF8 RTL for Windows
On Thu, Nov 20, 2014 at 1:21 PM, Mattias Gaertner nc-gaert...@netcologne.de wrote: Hi all, especially Windows users, The development version of FPC 2.7.1 has extended Strings and many RTL functions now work for codepages other than the system codepage. This means Lazarus can now be compiled in two modes: 1. The old mode: LCL treats all String as UTF-8 encoded. When accessing RTL and WinAPI functions you have to use the UTF8 functions. For example aStringList.LoadFromFile(UTF8ToSys(Filename)) and FileExistsUTF8. Note that UTF8ToSys only supports characters in the Windows code page, while FileExistsUTF8 supports the full Unicode range. 2. The new mode: The LCL, FCL and RTL treat all String as UTF-8 encoded. Most RTL file functions now work with full Unicode. For example FileExists and aStringList.LoadFromFile(Filename) now support full Unicode. AnsiToUTF8, UTF8ToAnsi, SysToUTF8, UTF8ToAnsi have no effect. Many UTF8Encode and UTF8Decode calls are no longer needed, because when assigning UnicodeString to String and vice versus the compiler does it automatically for you. When accessing the WinAPI you must use the W functions or use UTF8ToWinCP and WinCPToUTF8. You can enable the new mode by compiling Lazarus clean with -dEnableUTF8RTL. More information about the new FPC Unicode Support: http://wiki.freepascal.org/FPC_Unicode_support RTL functions that now support Unicode under Windows: http://wiki.freepascal.org/FPC_Unicode_support#RTL_changes The above links are about the default RTL with system code page. I want to create a Wiki page to gather all information about the UTF8 RTL for Lazarus users and how to adapt their code. Please test and tell what you find out. Mattias The best news of the year! \o/ \o/ \o/ Thanks thanks thanks Lazarus/FPC team! (y) -- Silvio Clécio My public projects - github.com/silvioprog -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus