Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Am 31.03.2016 um 17:04 schrieb Juha Manninen: Anyway, the original issue was about inserting {codepage UTF8} automatically to every unit. We can conclude it is not a good idea. It does not solve anything when using plain constants with default String type but adds conversion overhead. It breaks things when using constants with ShortString and PChar. I'm on your side. It was a good thing to ask here, cause it pointed out, that the disadvantages predominate the advantages. Im clear for myself to that issue and can see the results in my tests too. Thank you very much Kind regards Michl -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Am 31.03.2016 um 12:44 schrieb Mattias Gaertner: On Thu, 31 Mar 2016 00:16:13 +0200 "Michael W. Vogel"wrote: [...] I've tested the example too and I got different results with different options. The test was: - BOM / no BOM at the beginning of the sourcefile - {$codepage UTF8} or not The compiler understands -FcUTF8, {$codepage utf8} and BOM. All three sets UTF-8. See here: http://wiki.freepascal.org/FPC_Unicode_support#Source_file_codepage BOM has the advantage that it is understood by other text editors as well and the disadvantage that it is hidden, so that people unaware of encodings are easily confused. -FcUTF8 has the advantage of applying it to all sources in the project/package and it can easily be turned off. You can unset it for a single unit via {$modeswitch systemcodepage}. - fpc -MObjFPC *-Sh* test.pas (with / without -Sh (use reference counted strings)) And this is where the confusion starts. Mixing multiple string types is asking for troubles. FPC has an impressive (aka frightening) list of string types and consequently a vast net of combinations that only graph theorists can appreciate. So it is realy more complex as I thought... Yes. And you have not yet explored the difficulties in code supporting both FPC 2.6.4 and 3+ and LCL 1.4 and 1.6. Although Lazarus recommends to "simply" use UTF-8, technically it recommends AnsiString, DefaultSystemCodepage CP_UTF8, no explicit codepage, and the UTF-8 functions in LazUtils. If you need to use other string types in an unit you might want to add an explicit codepage. Maybe a paragraph should be added to the wiki about using non AnsiString with the "Lazarus UTF-8". Mattias Thank you very much, for your detailed answer! I'll try to run some more tests, to understand why a BOM for UTF-8 has a other behaviour than a {$codepage UTF8}. BTW the conversions here has nothing to do with Lazarus, it is only a FPC issue. If I don't find a answer for myself, I'll ask in the FPC mailing list. Thanks again Kind regards Michl -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/31/16, Juha Manninenwrote: >> In my fantasy scenario the String would of course have the meaning of >> UnicodeString. > > That is not anyhow better (or worse) inherently than a UTF-8 based > solution. No, but I don't see fpc moving towards String equals AnsiString(CP_UTF8). It would be hugely Delphi imcompatible. For me personally Delphi compatibility does not matter at all, I have left Delphi and won't return. But from the fpc side, breaking compatibility in such a way is probably going to be a big no-no. But you are rigt, we are getting off-topic more and more. Sorry for that. Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Thu, Mar 31, 2016 at 5:20 PM, Bartwrote: > In my fantasy scenario the String would of course have the meaning of > UnicodeString. That is not anyhow better (or worse) inherently than a UTF-8 based solution. Delphi just happened to implement it so, for various reasons. The surprise is that our system is so Delphi compatible even while having a different encoding. > The tests I posted in this thread were plain fpc programs, > so no use of LazUtf8. It pointed out that the Lazarus part of > the wiki (Better Unicode support) could be interpreted wrong. No. You interpreted it wrong for some reason. The page is only about the new Unicode support which is very clearly mentioned there! You don't have to use the UTF-8 mode which is explained in another page : http://wiki.freepascal.org/Lazarus_with_FPC3.0_without_UTF-8_mode The main message however is that the new mode should be used unless there is a very good reason not to. That's why it was made default when LazUtils / LazUTF8 is used. For console apps you must add the dependency / unit explicitly but it does not change any facts about the mode. Anyway, the original issue was about inserting {codepage UTF8} automatically to every unit. We can conclude it is not a good idea. It does not solve anything when using plain constants with default String type but adds conversion overhead. It breaks things when using constants with ShortString and PChar. It only improves things with UnicodeString constants which is not necessarily needed at all, but can be used with added {codepage UTF8}. Simple, no hassle! Besides I feel the problems are exaggerated again. The problems discussed here are only about constants. The automatic conversion between variables works always. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/31/16, Juha Manninenwrote: > I doubt you will change every "String" into "UnicodeString" in your code. > Somehow you missed the fundamental idea of our new Unicode system. > "String" has Unicode and you don't need to care about it, or even > about endianess. > Delphi reaches the same goal by mapping String -> UnicodeString. > When you need Ansi codepages then you need to pay attention obviously, > otherwise not. In my fantasy scenario the String would of course have the meaning of UnicodeString. > Bart, were your earlier results caused by NOT using the new Unicode > support? The tests I posted in this thread were plain fpc programs, so no use of LazUtf8. It pointed out that the Lazarus part of the wiki (Better Unicode support) could be interpreted wrong. > The whole discussion was about the new Unicode support and you have > been testing it since the beginning. I have use the "Utf8 in RTL" ever since i swithed to the 3.0 compiler (I have to admit I was to scared to test the 2.7 branch, but I started with the first 3.0 RC), and the number of bugs we had to fix was far, far less than I thought it would be. Most of my own programs seem to require no change at all. The main excpetion being my backupprogram, but not beause it stopped correctly accessing filenames with unicode cahracter in their path, but because I used a procedural paramter in it's engine, for which now the signature had changed from plain string to either RawByteString or UnicodeString. All this was solved (with ifdefs for the 2.6 compiler) quit easily. And as I pointed out to you earlier in another thread in the forum, the "new UTF8 system" works even better than I thought. Then again, I do not use databases in any of my programs, so I don't have to deal with that part of the problem. All my textual data is stored in plain textfiles in UTF8 encoding, probably from the day I started using Lazarus as my main platform (0.9.16). Anyhow, it's a fascinating issue, and the more I understand about it, the better I am able to fix encoding related problems in our (or user's) code. Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Thu, Mar 31, 2016 at 4:25 PM, Bartwrote: > in this scenario adding {$codepage utf8} may be the wise thing to do: > it eliminates all confusion about the intended encoding of the string > constant. How is a conversion to UTF-16 and then back to UTF-8 less confusing than a direct copy without conversions? > When you use UnicodeString everywhere and no AnsiString anywhere, then > the only confusion left is Endianess,or am I (as one of the Universes > idiots) oversimplifying here. I doubt you will change every "String" into "UnicodeString" in your code. Somehow you missed the fundamental idea of our new Unicode system. "String" has Unicode and you don't need to care about it, or even about endianess. Delphi reaches the same goal by mapping String -> UnicodeString. When you need Ansi codepages then you need to pay attention obviously, otherwise not. Bart, were your earlier results caused by NOT using the new Unicode support? I am surprised if that is the case. The whole discussion was about the new Unicode support and you have been testing it since the beginning. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Thu, 31 Mar 2016 15:25:03 +0200 Bartwrote: > On 3/31/16, Mattias Gaertner wrote: > > >> Will all this mess go away if we would go the Delphi way > >> (String=UnicodeString)? > >> (I know *nix users are going to hate me now) > > > > Which mess do you mean? > > As long as you have to consider codepages, you can get a mess. > > When you use UnicodeString everywhere and no AnsiString anywhere, then > the only confusion left is Endianess,or am I (as one of the Universes > idiots) oversimplifying here. No, you are right. If you somehow(TM) achieve to work only with UnicodeString you avoid the mess. The same if only use UTF-8 strings. Or if you only work in system codepage like in TP 3.0 times. The problem is that you often has to work with databases/files/libs/etc in other encodings. So there is always a little bit of mess. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/31/16, Mattias Gaertnerwrote: >> Will all this mess go away if we would go the Delphi way >> (String=UnicodeString)? >> (I know *nix users are going to hate me now) > > Which mess do you mean? > As long as you have to consider codepages, you can get a mess. When you use UnicodeString everywhere and no AnsiString anywhere, then the only confusion left is Endianess,or am I (as one of the Universes idiots) oversimplifying here. In TP 3.0 I didn't have to deal with all this (but then again the guide that came with it was in Hebrew, which was a bit difficult for me). Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Thu, 31 Mar 2016 14:32:27 +0200 Bartwrote: > On 3/31/16, Mattias Gaertner wrote: >[...] > So, when my usecase for string constants with diacritics in real life > most of the time is just captions for buttons/menu's etc., the extra > overhead will not really be something to worry about I guess,and in > this scenario adding {$codepage utf8} may be the wise thing to do: it > eliminates all confusion about the intended encoding of the string > constant. Well, I'm not so sure about the "eliminates all confusion" as Rick Cook said: "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning." > So, my current intended approach for GUI applications will be: > - declare all strings as just String > - have stringconstants with unicode character all in one file and add > {$codepage utf8) to that file, and then don't use -FcUTF8 anymore > (which is what I'm doing ATM), > > That should be rather safe then I guess. Yes. If you avoid PChar(Literal), invalid UTF-8 and #0. > Will all this mess go away if we would go the Delphi way > (String=UnicodeString)? > (I know *nix users are going to hate me now) Which mess do you mean? As long as you have to consider codepages, you can get a mess. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/31/16, Mattias Gaertnerwrote: >> AFAIK the IDE does not save the file with a BOM, so the compiler may >> very well decide that my sourcefile has ACP codepage? > > Yes and no. > When the compiler assumes ACP, it treats the string special. It does > not convert it and stores it as byte copy. At runtime the string has > CP_ACP and its codepage is defined by the variable > DefaultSystemCodePage. LazUTF8 sets this to CP_UTF8, so the string is > treated as UTF-8. Note that it does that without any conversion. > > OTOH when you tell the compiler that the source is UTF-8, it converts > the literal to UTF-16. At runtime it converts the string back to UTF-8. > It does that everytime you assign the literal. > > So, with both you get an UTF-8 string, but the latter has a bit more > overhead. Also the latter needs special care when typecasting (e.g. > PChar). So, when my usecase for string constants with diacritics in real life most of the time is just captions for buttons/menu's etc., the extra overhead will not really be something to worry about I guess,and in this scenario adding {$codepage utf8} may be the wise thing to do: it eliminates all confusion about the intended encoding of the string constant. So, my current intended approach for GUI applications will be: - declare all strings as just String - have stringconstants with unicode character all in one file and add {$codepage utf8) to that file, and then don't use -FcUTF8 anymore (which is what I'm doing ATM), That should be rather safe then I guess. Will all this mess go away if we would go the Delphi way (String=UnicodeString)? (I know *nix users are going to hate me now) Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Wed, 30 Mar 2016 18:16:32 +0200 Bartwrote: >[...] > > Any valid UTF-8 string should work, including diacritics. > Without the codepage identier? Yes, if you use LazUTF8. If you don't use LazUTF8 and assign a literal to a UnicodeString you need the codepage. > Quote from http://wiki.freepascal.org/FPC_Unicode_support#String_constants: > "Normally, a string constant is interpreted according to the source > file codepage. If the source file codepage is CP_ACP, a default is > used instead: in that case, during conversions the constant strings > are assumed to have code page 28591 (ISO 8859-1 Latin 1; Western > European). " AFAIK this is not entirely correct. The string literals are assumed to be system codepage, which does not need to be code page 28591. I will ask on the fpc list. > ... > "From the above it follows that to ensure predictable interpretation > of string constants in your source code, it is best to either include > an explicit {$codepage xxx} directive (or use the equivalent -Fc > command line option), or to save the source code in UTF-8 with a BOM. > " > > AFAIK the IDE does not save the file with a BOM, so the compiler may > very well decide that my sourcefile has ACP codepage? Yes and no. When the compiler assumes ACP, it treats the string special. It does not convert it and stores it as byte copy. At runtime the string has CP_ACP and its codepage is defined by the variable DefaultSystemCodePage. LazUTF8 sets this to CP_UTF8, so the string is treated as UTF-8. Note that it does that without any conversion. OTOH when you tell the compiler that the source is UTF-8, it converts the literal to UTF-16. At runtime it converts the string back to UTF-8. It does that everytime you assign the literal. So, with both you get an UTF-8 string, but the latter has a bit more overhead. Also the latter needs special care when typecasting (e.g. PChar). >[...] Consider this test sourcefile (encoded as UTF8 without BOM): >[...] > DefaultSystemcodePage = 1252 >[...] > I would say that this experiment contradicts the statement in > http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals > ? No contradiction, because this wiki page is about DefaultSystemcodePage = CP_UTF8. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Thu, 31 Mar 2016 00:16:13 +0200 "Michael W. Vogel"wrote: >[...] > I've tested the example too and I got different results with different > options. The test was: > - BOM / no BOM at the beginning of the sourcefile > - {$codepage UTF8} or not The compiler understands -FcUTF8, {$codepage utf8} and BOM. All three sets UTF-8. See here: http://wiki.freepascal.org/FPC_Unicode_support#Source_file_codepage BOM has the advantage that it is understood by other text editors as well and the disadvantage that it is hidden, so that people unaware of encodings are easily confused. -FcUTF8 has the advantage of applying it to all sources in the project/package and it can easily be turned off. You can unset it for a single unit via {$modeswitch systemcodepage}. > - fpc -MObjFPC *-Sh* test.pas (with / without -Sh (use reference counted > strings)) And this is where the confusion starts. Mixing multiple string types is asking for troubles. FPC has an impressive (aka frightening) list of string types and consequently a vast net of combinations that only graph theorists can appreciate. > So it is realy more complex as I thought... Yes. And you have not yet explored the difficulties in code supporting both FPC 2.6.4 and 3+ and LCL 1.4 and 1.6. Although Lazarus recommends to "simply" use UTF-8, technically it recommends AnsiString, DefaultSystemCodepage CP_UTF8, no explicit codepage, and the UTF-8 functions in LazUtils. If you need to use other string types in an unit you might want to add an explicit codepage. Maybe a paragraph should be added to the wiki about using non AnsiString with the "Lazarus UTF-8". Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Thu, 31 Mar 2016 01:20:14 +0200 Bartwrote: >[...] > I was wondering why DefaultSystemCodepage would return CP_ACP on > Graemes FreeBsd with an UTF8 locale? The problem only exists on Windows (more exact: OS with system codepage<>CP_UTF8). Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Thu, 31 Mar 2016 00:26:44 +0200 Bartwrote: > On 3/30/16, Juha Manninen wrote: >[...] > I think the statement in the wiki that {$codepage utf8} is not needed is > wrong. You can use UTF-8 without the {$codepage utf8}. But there are cases where it is needed. Maybe the wording can be improved. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Am 31.03.2016 00:48 schrieb "Bart": > > On 3/31/16, Graeme Geldenhuys wrote: > > > [~]$ echo $LANG > > en_GB.UTF-8 > > This is what I hink is happening to your test (Sven can probably > explain it better): Jonas would probably be a better choice. Or the wiki page where the changes are documented (don't know it right now, but Jonas refers to it rather often :) ) Regards, Sven -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/31/16, Maxim Ganetskywrote: > Lazarus switches DefaultSystemCodePage to 65001, so your example works > OK here without codepage directive (when inserted into LCL dependent > project, of course). To me it is unclear wether "your example" refers to Graeme or to me. Either way both of us simply used a plain fpc console application, compiled on commandline with fpc. I was wondering why DefaultSystemCodepage would return CP_ACP on Graemes FreeBsd with an UTF8 locale? Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
31.03.2016 1:48, Bart пишет: On 3/31/16, Graeme Geldenhuyswrote: [~]$ echo $LANG en_GB.UTF-8 This is what I hink is happening to your test (Sven can probably explain it better): Since your locale is UTF8, CP_ACPand CP_UTF8 refer to the same codepage, therefor the contents of S1 in either case is correct. Without the codepage identifiier on Windows, after the assignement S1 contains a UTF8 byte-sequence, but the compiler treats it as CP_ACP, and therefor on my system as codepage 1252, and now the string has a completely different meaning. Lazarus switches DefaultSystemCodePage to 65001, so your example works OK here without codepage directive (when inserted into LCL dependent project, of course). -- Best regards, Maxim Ganetsky mailto:gan...@narod.ru -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/31/16, Graeme Geldenhuyswrote: > [~]$ echo $LANG > en_GB.UTF-8 This is what I hink is happening to your test (Sven can probably explain it better): Since your locale is UTF8, CP_ACPand CP_UTF8 refer to the same codepage, therefor the contents of S1 in either case is correct. Without the codepage identifiier on Windows, after the assignement S1 contains a UTF8 byte-sequence, but the compiler treats it as CP_ACP, and therefor on my system as codepage 1252, and now the string has a completely different meaning. Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 2016-03-30 23:29, Bart wrote: > DefaultSystemCodePage = 0? > > What system are you on (Linux, Windows) and what locale? 64-bit FreeBSD 10.1 with English (UK) locale. I also used FPC 3.0.0 released compiler installed from the official .tar file. [~]$ echo $LANG en_GB.UTF-8 My test program is attached. Compiled with: fpc test.pas Regards, - Graeme - -- fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal http://fpgui.sourceforge.net/ My public PGP key: http://tinyurl.com/graeme-pgp program test; {$mode objfpc}{$H+} {$codepage utf8} uses SysUtils, StrUtils; function StrToHex(const s: string): string; begin setlength(Result, 2*length(s)); BinToHex(PChar(s), PChar(Result), length(s)); end; const TestUtf8 = 'ÃAÃ'; var s1: String; begin writeln('DefaultSystemcodePage = ',DefaultSystemcodePage); writeln('TestUtf8 = ',StrToHex(TestUtf8)); s1 := TestUtf8; writeln('S1 = ',StrToHex(S1),' [',StringCodePage(S1),']'); writeln(S1); //will trigger outmatic codepage conversion to console's codepage when needed end.-- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/30/16, Graeme Geldenhuyswrote: > Just thought I would let you know that with or without the {$codepage > utf8}, your code works just fine here. Source code is saved in a UTF-8 > encoding with no BOM marker. > > > [tmp]$ fpc test.pas > Free Pascal Compiler version 3.0.0 [2015/11/16] for x86_64 > Copyright (c) 1993-2015 by Florian Klaempfl and others > > [tmp]$ ./test > DefaultSystemcodePage = 0 DefaultSystemCodePage = 0? What system are you on (Linux, Windows) and what locale? I'm on Windows with Dutch locale (Windows codepage = cp1252) Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/30/16, Juha Manninenwrote: > If your "s1" is a plain String then something has changed. IIRC it worked > well. It is a plain string. And it behaves like the quote said. The compiler treats my sourcefile as ACP > I am out of energy for the string encoding issue and I don't even have > a proper Windows system to test with. > Could maybe Mattias, you, Michl and whoever take care of the issue please. I think the statement in the wiki that {$codepage utf8} is not needed is wrong. Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 2016-03-30 17:16, Bart wrote: > Without {$codepage utf8} it outputs: > DefaultSystemcodePage = 1252 > TestUtf8 = $C3 $84 $41 $C3 $84 > S1 = $C3 $84 $41 $C3 $84 [0] > Ã"AÃ" > > The compiler treats my source as if it were written in my system's codepage. > With cp1552 S1 now contains garbage (Ã"AÃ"). (at least not what I > expected it to be) Just thought I would let you know that with or without the {$codepage utf8}, your code works just fine here. Source code is saved in a UTF-8 encoding with no BOM marker. [tmp]$ fpc test.pas Free Pascal Compiler version 3.0.0 [2015/11/16] for x86_64 Copyright (c) 1993-2015 by Florian Klaempfl and others [tmp]$ ./test DefaultSystemcodePage = 0 TestUtf8 = C38441C384 S1 = C38441C384 [0] ÄAÄ [ adding $codepage and testing again.] [tmp]$ fpc test.pas Free Pascal Compiler version 3.0.0 [2015/11/16] for x86_64 Copyright (c) 1993-2015 by Florian Klaempfl and others [tmp]$ ./test DefaultSystemcodePage = 0 TestUtf8 = C38441C384 S1 = C38441C384 [65001] ÄAÄ Regards, - Graeme - -- fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal http://fpgui.sourceforge.net/ My public PGP key: http://tinyurl.com/graeme-pgp -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Wed, Mar 30, 2016 at 7:16 PM, Bartwrote: > [...] > I would say that this experiment contradicts the statement in > http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals > ? If your "s1" is a plain String then something has changed. IIRC it worked well. I am out of energy for the string encoding issue and I don't even have a proper Windows system to test with. Could maybe Mattias, you, Michl and whoever take care of the issue please. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/30/16, Juha Manninenwrote: > Do your files have UTF-8 encoding? It is a necessity for the Unicode > system to work. Yes, all my code is either from Lazarus or from my own editor (which is a synedit). > Any valid UTF-8 string should work, including diacritics. Without the codepage identier? Quote from http://wiki.freepascal.org/FPC_Unicode_support#String_constants: "Normally, a string constant is interpreted according to the source file codepage. If the source file codepage is CP_ACP, a default is used instead: in that case, during conversions the constant strings are assumed to have code page 28591 (ISO 8859-1 Latin 1; Western European). " ... "From the above it follows that to ensure predictable interpretation of string constants in your source code, it is best to either include an explicit {$codepage xxx} directive (or use the equivalent -Fc command line option), or to save the source code in UTF-8 with a BOM. " AFAIK the IDE does not save the file with a BOM, so the compiler may very well decide that my sourcefile has ACP codepage? Consider this test sourcefile (encoded as UTF8 without BOM): const TestUtf8 = 'ÄAÄ'; begin writeln('DefaultSystemcodePage = ',DefaultSystemcodePage); writeln('TestUtf8 = ',StrToHex(TestUtf8)); s1 := TestUtf8; writeln('S1 = ',StrToHex(S1),' [',StringCodePage(S1),']'); writeln(S1); //will trigger outmatic codepage conversion to console's codepage when needed end. Without {$codepage utf8} it outputs: DefaultSystemcodePage = 1252 TestUtf8 = $C3 $84 $41 $C3 $84 S1 = $C3 $84 $41 $C3 $84 [0] Ã"AÃ" The compiler treats my source as if it were written in my system's codepage. With cp1552 S1 now contains garbage (Ã"AÃ"). (at least not what I expected it to be) With the proper {$codepage utf8} inserted it will output: DefaultSystemcodePage = 1252 TestUtf8 = $C3 $84 $41 $C3 $84 S1 = $C3 $84 $41 $C3 $84 [65001] ÄAÄ I would say that this experiment contradicts the statement in http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals ? Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Wed, Mar 30, 2016 at 3:12 PM, Michael W. Vogelwrote: >> The cases fail with UTF-8 file encoding. > I don't understand this. I meant that some cases fail even when the file encoding is UTF-8. File encoding is not the issue. >> http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals > > And the first example there is wrong (or the words "and without" need to be > removed). With no defined codepage > const s: string = 'äöü'; > has codepoints of a UTF-8 String, the codepage is 0. If you assign it to a > string with a declared codepage, you get a corrupted string. See my example. I editor the page and separated 2 cases: const s = 'äöü'; and const s: string = 'äöü'; Please check. In a forum discussion it turned out they behave differently. It may even be a compiler bug. I cannot test it right now. Could you please edit the page as needed. Anyway, how often do you need to assign a constant to a string that has an explicitly declared codepage? Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Am 30.03.2016 11:23 schrieb "Juha Manninen": > > Ok, FPC had UnicodeString earlier than I remembered. > Currently WideString is often used with WinAPI when UnicodeString > should be used, as Marco reminded in another discussion. The WinAPI does not know UnicodeString. It only knows PWideChar or B_STR for COM which is in Pascal handled by WideString. Regards, Sven -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Am 30.03.2016 um 13:02 schrieb Juha Manninen: Conversions in your testproject may work, but you ignored the forum link I gave earlier. There "malcome" gave examples that fail. I don't ignored it, but I'm not so fast to test the examples there. I'll try the examples there for myself with and without the codepage definition. Thanks for that link. Am 30.03.2016 um 13:02 schrieb Juha Manninen: BTW, the hack is not made by LCL but by LazUtils which can be used also with cmd line / server programs. You are right. I only use LazUtils in combination with the component library, my mistake. Thanks for clearing this. Am 30.03.2016 um 13:02 schrieb Juha Manninen: I also don't want it. I want a added {$codepage UTF8}, if the file is saved as a UTF-8 encoded one. The cases fail with UTF-8 file encoding. I don't understand this. Am 30.03.2016 um 13:02 schrieb Juha Manninen: The issue with constant string encodings is more complex than you seem to understand. Thats true and I've made dozens of tests and spend a lot of time and try to help other people with it ... Am 30.03.2016 um 13:02 schrieb Juha Manninen: It is explained here somehow: http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals And the first example there is wrong (or the words "and without" need to be removed). With no defined codepage const s: string = 'äöü'; has codepoints of a UTF-8 String, the codepage is 0. If you assign it to a string with a declared codepage, you get a corrupted string. See my example. BTW I've forgotten ShortStrings in my example and maybe PChar. I'll add it (if possible) and try and report later. Thanks for your attention and time Kindly regards Michl -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Wed, Mar 30, 2016 at 1:03 PM, Bartwrote: > The IDE at least runs fine (in my locale on Windows) with -FcUTF8. Lazarus IDE does not have string constants beyond 7-bit ASCII. Encoding does not matter obviously. > (I have it there because I build all my projects with this define, > because almost all of them contain some strings with diacritics) Do your files have UTF-8 encoding? It is a necessity for the Unicode system to work. Any valid UTF-8 string should work, including diacritics. > Curious though: in what scenario's does it fail? See the forum link I gave earlier. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Wed, Mar 30, 2016 at 12:38 PM, Michael W. Vogelwrote: > With the hack that the LCL makes and the added {$codepage UTF8} all > conversions work like a charm (see added testproject). Conversions in your testproject may work, but you ignored the forum link I gave earlier. There "malcome" gave examples that fail. BTW, the hack is not made by LCL but by LazUtils which can be used also with cmd line / server programs. >> LCL applications nowadays use CP_UTF8 as default. We (laz team) tested >> adding -FcUTF8 and it failed in too many cases. Also it adds some overhead. >> So we decided to *not* add it by default. > > I also don't want it. I want a added {$codepage UTF8}, if the file is saved > as a UTF-8 encoded one. The cases fail with UTF-8 file encoding. All files created by Lazarus IDE are by default saved with UTF-8 encoding, thus you would get {$codepage UTF8} in every file which is the same as -FcUTF8 for the whole project. Adding -FcUTF8 is already extremely easy. There is a button for it in Project Options -> Custom Options page. The issue with constant string encodings is more complex than you seem to understand. It is explained here somehow: http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On 3/30/16, Mattias Gaertnerwrote: > LCL applications nowadays use CP_UTF8 as default. We (laz team) tested > adding > -FcUTF8 and it failed in too many cases. Also it adds some overhead. So we > decided to *not* add it by default. The IDE at least runs fine (in my locale on Windows) with -FcUTF8. (I have it there because I build all my projects with this define, because almost all of them contain some strings with diacritics) I'm too lazy just yet to insert proper codepage defines in the relevant places. Curious though: in what scenario's does it fail? Bart -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Am 30.03.2016 um 10:13 schrieb Juha Manninen: I don't know what is a a Predefined String Sorry, I was not clear. I mean a string with a declared codepage http://wiki.freepascal.org/FPC_Unicode_support#Declared_code_page -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Am 30.03.2016 um 10:11 schrieb Mattias Gaertner: You have to distinguish between with CP_UTF8 as default and with CP_ACP as default. Yes, I know it. What I mean with default: Go to Project -> New Project ... -> Application Now a new Application is created. With the added patch {$codepage UTF8} is added and thats right, cause if you save the file to anywhere it is UTF-8 encoded. There is nothing wrong. Am 30.03.2016 um 10:11 schrieb Mattias Gaertner: Both have cases where some string combinations fail. With the hack that the LCL makes and the added {$codepage UTF8} all conversions work like a charm (see added testproject). If you want to use -dDisableUTF8RTL, you have to know, what you do. - remove {$codepage UTF8}, better set it to the valid codepage or use -FcCP... - save all the source files with the wanted encoding/codepage So IMHO this special case wouldn't be used much. All the puzzled discussions I can see in the forums are about the default applications created with Lazarus (not FPC), that uses the LCL. Am 30.03.2016 um 10:11 schrieb Mattias Gaertner: LCL applications nowadays use CP_UTF8 as default. We (laz team) tested adding -FcUTF8 and it failed in too many cases. Also it adds some overhead. So we decided to *not* add it by default. I also don't want it. I want a added {$codepage UTF8}, if the file is saved as a UTF-8 encoded one. Am 30.03.2016 um 10:11 schrieb Mattias Gaertner: Offtopic: In the added project: Why is a const 'abc' with {$codepage UTF8} a Unicodestring (Windows7, 64bit, Lazarus 1.7 r52077M FPC 3.1.1 i386-win32-win32/win64 on FPC 3.1.1 r33371)? There is no compile-time flag to tell the compiler what codepage the system is using at runtime. So it assumes current Windows codepage. Any string literal not in this codepage is stored as UTF-16. Thanks for that hint! -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Wednesday 30 March 2016 11:23:49 Juha Manninen wrote: > > If one wants to handle BMP-chars comfortably and with good performance > > one has to convert from utf-8 in AnsiString to UnicodeString first. > > Maybe, but BMP-chars are not enough for a proper Unicode support. But they are enough to be used in Russian and German pupils homework, utf-8 code units are not enough, please read lazarusformum.de. ;-) Martin -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
Ok, FPC had UnicodeString earlier than I remembered. Currently WideString is often used with WinAPI when UnicodeString should be used, as Marco reminded in another discussion. Anyway, the problems found by Michael W. Vogel and "malcome" all deal with constants. Assignment between variables always works thanks of their dynamic encoding. If there is doubt about how a constant is interpreted, it can be first assigned to a "String" type variable which can then be typecasted to a WinAPI's UnicodeString parameter or whatever. In the worst case scenario an extra "String" variable is needed. Or, if one wants to use UnicodeString constants in a unit, he can add {$codepage utf8}. No big deal. IMO the problems are exaggerated. On Wed, Mar 30, 2016 at 11:58 AM, Martin Schreiberwrote: > If one wants to handle BMP-chars comfortably and with good performance one has > to convert from utf-8 in AnsiString to UnicodeString first. Maybe, but BMP-chars are not enough for a proper Unicode support. Besides, dealing with codepoints is the easy part regardless of encoding. The associated problems are exaggerated again. The true complexity of Unicode is beyond codepoints. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
On Wednesday 30 March 2016 10:13:36 Juha Manninen wrote: > With Unicodestring we don't need to care about backwards compatibility > really because it is so new type. Ouch! WideString has been introduced in Delphi 4 IIRC, FPC had an on all platforms reference counted 16-bit string which worked like current UnicodeString. IIRC it was about version 1.8 when FPC introduced this string type. Kylix WideString (Linux) also was reference counted. Later FPC changed WideString on Windows ( against my strong opposition, well-understood ;-) ) to the not reference counted OLE-string. A little bit later FPC added the on all platforms reference counted UnicodeString again. So one can say that at the moment when Lazarus became Unicode capable there was a UnicodeString-like stringtype available in FPC. It was very buggy, so probably this was one of the reasons that Lazarus used utf-8 in AnsiString instead. For MSEgui on the other hand I used WideString/UnicodeString from beginning and wrote FPC bug-reports until FPC WideString became production ready. > What more, Unicodestring is not needed often when using our new Unicode > system. > If one wants to handle BMP-chars comfortably and with good performance one has to convert from utf-8 in AnsiString to UnicodeString first. Martin -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
No, originally we had -FcUTF8 set by default but it caused more problems. See: http://forum.lazarus.freepascal.org/index.php?topic=30022 > In the most cases the string magic works without a defined {$codepage utf8}, > but not if you want to assign a const to a Predefined String or Unicodestring. I don't know what is a a Predefined String but assigning a const to Unicodestring can be seen as a special case and then a programmer can take special actions (add {$codepage utf8} himself). Leaving out {$codepage utf8} is the most backwards compatible way, and the operation is most intuitive. With Unicodestring we don't need to care about backwards compatibility really because it is so new type. What more, Unicodestring is not needed often when using our new Unicode system. Juha -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default
> "Michael W. Vogel"hat am 29. März 2016 um 23:20 > geschrieben: >[...] > I'm thinking about the thread here > http://forum.lazarus.freepascal.org/index.php/topic,31939.msg206688.html#msg206688[...] > In the most cases the string magic works without a defined {$codepage utf8}, > but not if you want to assign a const to a Predefined String or Unicodestring. > You have to distinguish between with CP_UTF8 as default and with CP_ACP as default. Both have cases where some string combinations fail. LCL applications nowadays use CP_UTF8 as default. We (laz team) tested adding -FcUTF8 and it failed in too many cases. Also it adds some overhead. So we decided to *not* add it by default. > Offtopic: In the added project: Why is a const 'abc' with {$codepage UTF8} a > Unicodestring (Windows7, 64bit, Lazarus 1.7 r52077M FPC 3.1.1 > i386-win32-win32/win64 on FPC 3.1.1 r33371)? There is no compile-time flag to tell the compiler what codepage the system is using at runtime. So it assumes current Windows codepage. Any string literal not in this codepage is stored as UTF-16. Mattias -- ___ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus