Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Michael W. Vogel



Am 31.03.2016 um 17:04 schrieb Juha Manninen:

Anyway, the original issue was about inserting {codepage UTF8}
automatically to every unit.
We can conclude it is not a good idea. It does not solve anything when
using plain constants with default String type but adds conversion
overhead. It breaks things when using constants with ShortString and
PChar.
I'm on your side. It was a good thing to ask here, cause it pointed out, 
that the disadvantages predominate the advantages.


Im clear for myself to that issue and can see the results in my tests too.

Thank you very much

Kind regards

Michl

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Michael W. Vogel

Am 31.03.2016 um 12:44 schrieb Mattias Gaertner:

On Thu, 31 Mar 2016 00:16:13 +0200
"Michael W. Vogel"  wrote:


[...]
I've tested the example too and I got different results with different
options. The test was:
- BOM / no BOM at the beginning of the sourcefile
- {$codepage UTF8} or not

The compiler understands -FcUTF8, {$codepage utf8}
and BOM. All three sets UTF-8. See here:
http://wiki.freepascal.org/FPC_Unicode_support#Source_file_codepage

BOM has the advantage that it is understood by other text editors as
well and the disadvantage that it is hidden, so that people unaware
of encodings are easily confused.

-FcUTF8 has the advantage of applying it to all sources in the
project/package and it can easily be turned off. You can unset it for a
single unit via {$modeswitch systemcodepage}.



- fpc -MObjFPC *-Sh* test.pas (with / without -Sh (use reference counted
strings))

And this is where the confusion starts. Mixing multiple string
types is asking for troubles. FPC has an impressive (aka frightening)
list of string types and consequently a vast net of combinations that
only graph theorists can appreciate.


So it is realy more complex as I thought...

Yes.
And you have not yet explored the difficulties in code supporting
both FPC 2.6.4 and 3+ and LCL 1.4 and 1.6.

Although Lazarus recommends to "simply" use UTF-8,
technically it recommends AnsiString, DefaultSystemCodepage CP_UTF8, no
explicit codepage, and the UTF-8 functions in LazUtils.
If you need to use other string types in an unit you might want to add
an explicit codepage. Maybe a paragraph should be added to the wiki
about using non AnsiString with the "Lazarus UTF-8".

  
Mattias



Thank you very much, for your detailed answer!

I'll try to run some more tests, to understand why a BOM for UTF-8 has a 
other behaviour than a {$codepage UTF8}.


BTW the conversions here has nothing to do with Lazarus, it is only a 
FPC issue. If I don't find a answer for myself, I'll ask in the FPC 
mailing list.


Thanks again

Kind regards

Michl

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Bart
On 3/31/16, Juha Manninen  wrote:

>> In my fantasy scenario the String would of course have the meaning of
>> UnicodeString.
>
> That is not anyhow better (or worse) inherently than a UTF-8 based
> solution.

No, but I don't see fpc moving towards String equals AnsiString(CP_UTF8).
It would be hugely Delphi imcompatible.

For me personally Delphi compatibility does not matter at all, I have
left Delphi and won't return.
But from the fpc side, breaking compatibility in such a way is
probably going to be a big no-no.

But you are rigt, we are getting off-topic more and more.
Sorry for that.

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Juha Manninen
On Thu, Mar 31, 2016 at 5:20 PM, Bart  wrote:
> In my fantasy scenario the String would of course have the meaning of
> UnicodeString.

That is not anyhow better (or worse) inherently than a UTF-8 based solution.
Delphi just happened to implement it so, for various reasons.
The surprise is that our system is so Delphi compatible even while
having a different encoding.

> The tests I posted in this thread were plain fpc programs,
> so no use of LazUtf8. It pointed out that the Lazarus part of
> the wiki (Better Unicode support) could be interpreted wrong.

No. You interpreted it wrong for some reason. The page is only about
the new Unicode support which is very clearly mentioned there!
You don't have to use the UTF-8 mode which is explained in another page :
  http://wiki.freepascal.org/Lazarus_with_FPC3.0_without_UTF-8_mode

The main message however is that the new mode should be used unless
there is a very good reason not to. That's why it was made default
when LazUtils / LazUTF8 is used.
For console apps you must add the dependency / unit explicitly but it
does not change any facts about the mode.

Anyway, the original issue was about inserting {codepage UTF8}
automatically to every unit.
We can conclude it is not a good idea. It does not solve anything when
using plain constants with default String type but adds conversion
overhead. It breaks things when using constants with ShortString and
PChar.

It only improves things with UnicodeString constants which is not
necessarily needed at all, but can be used with added {codepage UTF8}.
Simple, no hassle!

Besides I feel the problems are exaggerated again.
The problems discussed here are only about constants.
The automatic conversion between variables works always.

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Bart
On 3/31/16, Juha Manninen  wrote:

> I doubt you will change every "String" into "UnicodeString" in your code.
> Somehow you missed the fundamental idea of our new Unicode system.
> "String" has Unicode and you don't need to care about it, or even
> about endianess.
> Delphi reaches the same goal by mapping String -> UnicodeString.
> When you need Ansi codepages then you need to pay attention obviously,
> otherwise not.

In my fantasy scenario the String would of course have the meaning of
UnicodeString.

> Bart, were your earlier results caused by NOT using the new Unicode
> support?
The tests I posted in this thread were plain fpc programs, so no use of LazUtf8.
It pointed out that the Lazarus part of the wiki (Better Unicode
support) could be interpreted wrong.

> The whole discussion was about the new Unicode support and you have
> been testing it since the beginning.

I have use the "Utf8 in RTL" ever since i swithed to the 3.0 compiler
(I have to admit I was to scared to test the 2.7 branch, but I started
with the first 3.0 RC), and the number of bugs we had to fix was far,
far less than I thought it would be.

Most of my own programs seem to require no change at all.
The main excpetion being my backupprogram, but not beause it stopped
correctly accessing filenames with unicode cahracter in their path,
but because I used a procedural paramter in it's engine, for which now
the signature had changed from plain string to either RawByteString or
UnicodeString.
All this was solved (with ifdefs for the 2.6 compiler) quit easily.

And as I pointed out to you earlier in another thread in the forum,
the "new UTF8 system" works even better than I thought.

Then again, I do not use databases in any of my programs, so I don't
have to deal with that part of the problem.
All my textual data is stored in plain textfiles in UTF8 encoding,
probably from the day I started using Lazarus as my main platform
(0.9.16).

Anyhow, it's a fascinating issue, and the more I understand about it,
the better I am able to fix encoding related problems in our (or
user's) code.

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Juha Manninen
On Thu, Mar 31, 2016 at 4:25 PM, Bart  wrote:
> in this scenario adding {$codepage utf8} may be the wise thing to do:
> it eliminates all confusion about the intended encoding of the string 
> constant.

How is a conversion to UTF-16 and then back to UTF-8 less confusing
than a direct copy without conversions?

> When you use UnicodeString everywhere and no AnsiString anywhere, then
> the only confusion left is Endianess,or am I (as one of the Universes
> idiots) oversimplifying here.

I doubt you will change every "String" into "UnicodeString" in your code.
Somehow you missed the fundamental idea of our new Unicode system.
"String" has Unicode and you don't need to care about it, or even
about endianess.
Delphi reaches the same goal by mapping String -> UnicodeString.
When you need Ansi codepages then you need to pay attention obviously,
otherwise not.

Bart, were your earlier results caused by NOT using the new Unicode support?
I am surprised if that is the case.
The whole discussion was about the new Unicode support and you have
been testing it since the beginning.

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Mattias Gaertner
On Thu, 31 Mar 2016 15:25:03 +0200
Bart  wrote:

> On 3/31/16, Mattias Gaertner  wrote:
> 
> >> Will all this mess go away if we would go the Delphi way
> >> (String=UnicodeString)?
> >> (I know *nix users are going to hate me now)
> >
> > Which mess do you mean?
> > As long as you have to consider codepages, you can get a mess.
> 
> When you use UnicodeString everywhere and no AnsiString anywhere, then
> the only confusion left is Endianess,or am I (as one of the Universes
> idiots) oversimplifying here.

No, you are right. If you somehow(TM) achieve to work only with
UnicodeString you avoid the mess. The same if only use UTF-8 strings. Or
if you only work in system codepage like in TP 3.0 times.

The problem is that you often has to work with
databases/files/libs/etc in other encodings.
So there is always a little bit of mess.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Bart
On 3/31/16, Mattias Gaertner  wrote:

>> Will all this mess go away if we would go the Delphi way
>> (String=UnicodeString)?
>> (I know *nix users are going to hate me now)
>
> Which mess do you mean?
> As long as you have to consider codepages, you can get a mess.

When you use UnicodeString everywhere and no AnsiString anywhere, then
the only confusion left is Endianess,or am I (as one of the Universes
idiots) oversimplifying here.

In TP 3.0 I didn't have to deal with all this (but then again the
guide that came with it was in Hebrew, which was a bit difficult for
me).

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Mattias Gaertner
On Thu, 31 Mar 2016 14:32:27 +0200
Bart  wrote:

> On 3/31/16, Mattias Gaertner  wrote:
>[...]
> So, when my usecase for string constants with diacritics in real life
> most of the time is just captions for buttons/menu's etc., the extra
> overhead will not really be something to worry about I guess,and in
> this scenario adding {$codepage utf8} may be the wise thing to do: it
> eliminates all confusion about the intended encoding of the string
> constant.

Well, I'm not so sure about the "eliminates all confusion" as Rick Cook
said:
"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying
to build bigger and better idiots. So far, the universe is winning."

 
> So, my current intended approach for GUI applications will be:
> - declare all strings as just String
> - have stringconstants with unicode character all in one file and add
> {$codepage utf8) to that file, and then don't use -FcUTF8 anymore
> (which is what I'm doing ATM),
> 
> That should be rather safe then I guess.

Yes. 
If you avoid PChar(Literal), invalid UTF-8 and #0.

 
> Will all this mess go away if we would go the Delphi way 
> (String=UnicodeString)?
> (I know *nix users are going to hate me now)

Which mess do you mean?
As long as you have to consider codepages, you can get a mess.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Bart
On 3/31/16, Mattias Gaertner  wrote:

>> AFAIK the IDE does not save the file with a BOM, so the compiler may
>> very well decide that my sourcefile has ACP codepage?
>
> Yes and no.
> When the compiler assumes ACP, it treats the string special. It does
> not convert it and stores it as byte copy. At runtime the string has
> CP_ACP and its codepage is defined by the variable
> DefaultSystemCodePage. LazUTF8 sets this to CP_UTF8, so the string is
> treated as UTF-8. Note that it does that without any conversion.
>
> OTOH when you tell the compiler that the source is UTF-8, it converts
> the literal to UTF-16. At runtime it converts the string back to UTF-8.
> It does that everytime you assign the literal.
>
> So, with both you get an UTF-8 string, but the latter has a bit more
> overhead. Also the latter needs special care when typecasting (e.g.
> PChar).

So, when my usecase for string constants with diacritics in real life
most of the time is just captions for buttons/menu's etc., the extra
overhead will not really be something to worry about I guess,and in
this scenario adding {$codepage utf8} may be the wise thing to do: it
eliminates all confusion about the intended encoding of the string
constant.

So, my current intended approach for GUI applications will be:
- declare all strings as just String
- have stringconstants with unicode character all in one file and add
{$codepage utf8) to that file, and then don't use -FcUTF8 anymore
(which is what I'm doing ATM),

That should be rather safe then I guess.

Will all this mess go away if we would go the Delphi way (String=UnicodeString)?
(I know *nix users are going to hate me now)

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Mattias Gaertner
On Wed, 30 Mar 2016 18:16:32 +0200
Bart  wrote:

>[...]
> > Any valid UTF-8 string should work, including diacritics.
> Without the codepage identier?

Yes, if you use LazUTF8.
If you don't use LazUTF8 and assign a literal to a UnicodeString you
need the codepage.

 
> Quote from http://wiki.freepascal.org/FPC_Unicode_support#String_constants:
> "Normally, a string constant is interpreted according to the source
> file codepage. If the source file codepage is CP_ACP, a default is
> used instead: in that case, during conversions the constant strings
> are assumed to have code page 28591 (ISO 8859-1 Latin 1; Western
> European). "

AFAIK this is not entirely correct. The string literals are assumed
to be system codepage, which does not need to be code page 28591.
I will ask on the fpc list.


> ...
> "From the above it follows that to ensure predictable interpretation
> of string constants in your source code, it is best to either include
> an explicit {$codepage xxx} directive (or use the equivalent -Fc
> command line option), or to save the source code in UTF-8 with a BOM.
> "
> 
> AFAIK the IDE does not save the file with a BOM, so the compiler may
> very well decide that my sourcefile has ACP codepage?

Yes and no.
When the compiler assumes ACP, it treats the string special. It does
not convert it and stores it as byte copy. At runtime the string has
CP_ACP and its codepage is defined by the variable
DefaultSystemCodePage. LazUTF8 sets this to CP_UTF8, so the string is
treated as UTF-8. Note that it does that without any conversion.

OTOH when you tell the compiler that the source is UTF-8, it converts
the literal to UTF-16. At runtime it converts the string back to UTF-8.
It does that everytime you assign the literal.

So, with both you get an UTF-8 string, but the latter has a bit more
overhead. Also the latter needs special care when typecasting (e.g.
PChar).


>[...] Consider this test sourcefile (encoded as UTF8 without BOM):
>[...]
> DefaultSystemcodePage = 1252
>[...]
> I would say that this experiment contradicts the statement in
> http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals
> ?

No contradiction, because this wiki page is about DefaultSystemcodePage
= CP_UTF8. 

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Mattias Gaertner
On Thu, 31 Mar 2016 00:16:13 +0200
"Michael W. Vogel"  wrote:

>[...]
> I've tested the example too and I got different results with different 
> options. The test was:
> - BOM / no BOM at the beginning of the sourcefile
> - {$codepage UTF8} or not

The compiler understands -FcUTF8, {$codepage utf8}
and BOM. All three sets UTF-8. See here:
http://wiki.freepascal.org/FPC_Unicode_support#Source_file_codepage

BOM has the advantage that it is understood by other text editors as
well and the disadvantage that it is hidden, so that people unaware
of encodings are easily confused.

-FcUTF8 has the advantage of applying it to all sources in the
project/package and it can easily be turned off. You can unset it for a
single unit via {$modeswitch systemcodepage}.


> - fpc -MObjFPC *-Sh* test.pas (with / without -Sh (use reference counted 
> strings))

And this is where the confusion starts. Mixing multiple string
types is asking for troubles. FPC has an impressive (aka frightening)
list of string types and consequently a vast net of combinations that
only graph theorists can appreciate.

> So it is realy more complex as I thought...

Yes. 
And you have not yet explored the difficulties in code supporting
both FPC 2.6.4 and 3+ and LCL 1.4 and 1.6.

Although Lazarus recommends to "simply" use UTF-8,
technically it recommends AnsiString, DefaultSystemCodepage CP_UTF8, no
explicit codepage, and the UTF-8 functions in LazUtils.
If you need to use other string types in an unit you might want to add
an explicit codepage. Maybe a paragraph should be added to the wiki
about using non AnsiString with the "Lazarus UTF-8".

 
Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Mattias Gaertner
On Thu, 31 Mar 2016 01:20:14 +0200
Bart  wrote:

>[...]
> I was wondering why DefaultSystemCodepage would return CP_ACP on
> Graemes FreeBsd with an UTF8 locale?

The problem only exists on Windows (more exact: OS with system
codepage<>CP_UTF8).

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Mattias Gaertner
On Thu, 31 Mar 2016 00:26:44 +0200
Bart  wrote:

> On 3/30/16, Juha Manninen  wrote:
>[...]
> I think the statement in the wiki that {$codepage utf8} is not needed is 
> wrong.

You can use UTF-8 without the {$codepage utf8}. But there are cases
where it is needed. Maybe the wording can be improved.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-31 Thread Sven Barth
Am 31.03.2016 00:48 schrieb "Bart" :
>
> On 3/31/16, Graeme Geldenhuys  wrote:
>
> > [~]$ echo $LANG
> > en_GB.UTF-8
>
> This is what I hink is happening to your test (Sven can probably
> explain it better):

Jonas would probably be a better choice. Or the wiki page where the changes
are documented (don't know it right now, but Jonas refers to it rather
often :) )

Regards,
Sven
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Bart
On 3/31/16, Maxim Ganetsky  wrote:

> Lazarus switches DefaultSystemCodePage to 65001, so your example works
> OK here without codepage directive (when inserted into LCL dependent
> project, of course).

To me it is unclear wether "your example" refers to Graeme or to me.
Either way both of us simply used a plain fpc console application,
compiled on commandline with fpc.

I was wondering why DefaultSystemCodepage would return CP_ACP on
Graemes FreeBsd with an UTF8 locale?

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Maxim Ganetsky

31.03.2016 1:48, Bart пишет:

On 3/31/16, Graeme Geldenhuys  wrote:


[~]$ echo $LANG
en_GB.UTF-8


This is what I hink is happening to your test (Sven can probably
explain it better):

Since your locale is UTF8, CP_ACPand CP_UTF8 refer to the same
codepage, therefor the contents of S1 in either case is correct.

Without the codepage identifiier on Windows, after the assignement S1
contains a UTF8 byte-sequence, but the compiler treats it as CP_ACP,
and therefor on my system as codepage 1252, and now the string has a
completely different meaning.


Lazarus switches DefaultSystemCodePage to 65001, so your example works 
OK here without codepage directive (when inserted into LCL dependent 
project, of course).


--
Best regards,
 Maxim Ganetsky  mailto:gan...@narod.ru

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Bart
On 3/31/16, Graeme Geldenhuys  wrote:

> [~]$ echo $LANG
> en_GB.UTF-8

This is what I hink is happening to your test (Sven can probably
explain it better):

Since your locale is UTF8, CP_ACPand CP_UTF8 refer to the same
codepage, therefor the contents of S1 in either case is correct.

Without the codepage identifiier on Windows, after the assignement S1
contains a UTF8 byte-sequence, but the compiler treats it as CP_ACP,
and therefor on my system as codepage 1252, and now the string has a
completely different meaning.

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Graeme Geldenhuys
On 2016-03-30 23:29, Bart wrote:
> DefaultSystemCodePage = 0?
> 
> What system are you on (Linux, Windows) and what locale?


64-bit FreeBSD 10.1 with English (UK) locale. I also used FPC 3.0.0
released compiler installed from the official .tar file.

[~]$ echo $LANG
en_GB.UTF-8


My test program is attached. Compiled with:  fpc test.pas


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
program test;

{$mode objfpc}{$H+}
{$codepage utf8}

uses
  SysUtils, StrUtils;

  function StrToHex(const s: string): string;
  begin
setlength(Result, 2*length(s));
BinToHex(PChar(s), PChar(Result), length(s));
  end;
  
const
  TestUtf8 = 'ÄAÄ';
var
  s1: String;
begin
  writeln('DefaultSystemcodePage = ',DefaultSystemcodePage);
  writeln('TestUtf8 = ',StrToHex(TestUtf8));
  s1 := TestUtf8;
  writeln('S1   = ',StrToHex(S1),' [',StringCodePage(S1),']');
  writeln(S1); //will trigger outmatic codepage conversion to console's codepage when needed
end.--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Bart
On 3/30/16, Graeme Geldenhuys  wrote:

> Just thought I would let you know that with or without the {$codepage
> utf8}, your code works just fine here. Source code is saved in a UTF-8
> encoding with no BOM marker.
>
> 
> [tmp]$ fpc test.pas
> Free Pascal Compiler version 3.0.0 [2015/11/16] for x86_64
> Copyright (c) 1993-2015 by Florian Klaempfl and others
>
> [tmp]$ ./test
> DefaultSystemcodePage = 0

DefaultSystemCodePage = 0?

What system are you on (Linux, Windows) and what locale?
I'm on Windows with Dutch locale (Windows codepage = cp1252)

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Bart
On 3/30/16, Juha Manninen  wrote:

> If your "s1" is a plain String then something has changed. IIRC it worked
> well.
It is a plain string.
And it behaves like the quote said.
The compiler treats my sourcefile as ACP

> I am out of energy for the string encoding issue and I don't even have
> a proper Windows system to test with.
> Could maybe Mattias, you, Michl and whoever take care of the issue please.

I think the statement in the wiki that {$codepage utf8} is not needed is wrong.

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Graeme Geldenhuys
On 2016-03-30 17:16, Bart wrote:
> Without {$codepage utf8} it outputs:
> DefaultSystemcodePage = 1252
> TestUtf8 = $C3 $84 $41 $C3 $84
> S1   = $C3 $84 $41 $C3 $84 [0]
> Ã"AÃ"
> 
> The compiler treats my source as if it were written in my system's codepage.
> With cp1552 S1 now contains garbage (Ã"AÃ"). (at least not what I
> expected it to be)

Just thought I would let you know that with or without the {$codepage
utf8}, your code works just fine here. Source code is saved in a UTF-8
encoding with no BOM marker.


[tmp]$ fpc test.pas
Free Pascal Compiler version 3.0.0 [2015/11/16] for x86_64
Copyright (c) 1993-2015 by Florian Klaempfl and others

[tmp]$ ./test
DefaultSystemcodePage = 0
TestUtf8 = C38441C384
S1   = C38441C384 [0]
ÄAÄ

[ adding $codepage and testing again.]

[tmp]$ fpc test.pas
Free Pascal Compiler version 3.0.0 [2015/11/16] for x86_64
Copyright (c) 1993-2015 by Florian Klaempfl and others

[tmp]$ ./test
DefaultSystemcodePage = 0
TestUtf8 = C38441C384
S1   = C38441C384 [65001]
ÄAÄ


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Juha Manninen
On Wed, Mar 30, 2016 at 7:16 PM, Bart  wrote:
> [...]
> I would say that this experiment contradicts the statement in
> http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals
> ?

If your "s1" is a plain String then something has changed. IIRC it worked well.
I am out of energy for the string encoding issue and I don't even have
a proper Windows system to test with.
Could maybe Mattias, you, Michl and whoever take care of the issue please.

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Bart
On 3/30/16, Juha Manninen  wrote:

> Do your files have UTF-8 encoding? It is a necessity for the Unicode
> system to work.

Yes, all my code is either from Lazarus or from my own editor (which
is a synedit).

> Any valid UTF-8 string should work, including diacritics.
Without the codepage identier?

Quote from http://wiki.freepascal.org/FPC_Unicode_support#String_constants:
"Normally, a string constant is interpreted according to the source
file codepage. If the source file codepage is CP_ACP, a default is
used instead: in that case, during conversions the constant strings
are assumed to have code page 28591 (ISO 8859-1 Latin 1; Western
European). "
...
"From the above it follows that to ensure predictable interpretation
of string constants in your source code, it is best to either include
an explicit {$codepage xxx} directive (or use the equivalent -Fc
command line option), or to save the source code in UTF-8 with a BOM.
"

AFAIK the IDE does not save the file with a BOM, so the compiler may
very well decide that my sourcefile has ACP codepage?

Consider this test sourcefile (encoded as UTF8 without BOM):

const
  TestUtf8 = 'ÄAÄ';

begin
  writeln('DefaultSystemcodePage = ',DefaultSystemcodePage);
  writeln('TestUtf8 = ',StrToHex(TestUtf8));
  s1 := TestUtf8;
  writeln('S1   = ',StrToHex(S1),' [',StringCodePage(S1),']');
  writeln(S1); //will trigger outmatic codepage conversion to
console's codepage when needed
end.

Without {$codepage utf8} it outputs:
DefaultSystemcodePage = 1252
TestUtf8 = $C3 $84 $41 $C3 $84
S1   = $C3 $84 $41 $C3 $84 [0]
Ã"AÃ"

The compiler treats my source as if it were written in my system's codepage.
With cp1552 S1 now contains garbage (Ã"AÃ"). (at least not what I
expected it to be)

With the proper {$codepage utf8} inserted it will output:
DefaultSystemcodePage = 1252
TestUtf8 = $C3 $84 $41 $C3 $84
S1   = $C3 $84 $41 $C3 $84 [65001]
ÄAÄ

I would say that this experiment contradicts the statement in
http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals
?

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Juha Manninen
On Wed, Mar 30, 2016 at 3:12 PM, Michael W. Vogel  wrote:
>> The cases fail with UTF-8 file encoding.
> I don't understand this.

I meant that some cases fail even when the file encoding is UTF-8.
File encoding is not the issue.

>> http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals
>
> And the first example there is wrong (or the words "and without" need to be
> removed). With no defined codepage
> const s: string = 'äöü';
> has codepoints of a UTF-8 String, the codepage is 0. If you assign it to a
> string with a declared codepage, you get a corrupted string. See my example.

I editor the page and separated 2 cases:
  const s = 'äöü';
and
  const s: string = 'äöü';

Please check. In a forum discussion it turned out they behave
differently. It may even be a compiler bug.
I cannot test it right now. Could you please edit the page as needed.
Anyway, how often do you need to assign a constant to a string that
has an explicitly declared codepage?

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Sven Barth
Am 30.03.2016 11:23 schrieb "Juha Manninen" :
>
> Ok, FPC had UnicodeString earlier than I remembered.
> Currently WideString is often used with WinAPI when UnicodeString
> should be used, as Marco reminded in another discussion.

The WinAPI does not know UnicodeString. It only knows PWideChar or B_STR
for COM which is in Pascal handled by WideString.

Regards,
Sven
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Michael W. Vogel

Am 30.03.2016 um 13:02 schrieb Juha Manninen:
Conversions in your testproject may work, but you ignored the forum 
link I gave earlier. There "malcome" gave examples that fail. 
I don't ignored it, but I'm not so fast to test the examples there. I'll 
try the examples there for myself with and without the codepage 
definition. Thanks for that link.



Am 30.03.2016 um 13:02 schrieb Juha Manninen:
BTW, the hack is not made by LCL but by LazUtils which can be used 
also with cmd line / server programs. 
You are right. I only use LazUtils in combination with the component 
library, my mistake. Thanks for clearing this.



Am 30.03.2016 um 13:02 schrieb Juha Manninen:
I also don't want it. I want a added {$codepage UTF8}, if the file is 
saved as a UTF-8 encoded one. 

The cases fail with UTF-8 file encoding.

I don't understand this.


Am 30.03.2016 um 13:02 schrieb Juha Manninen:

The issue with constant string encodings is more complex than you seem
to understand.
Thats true and I've made dozens of tests and spend a lot of time and try 
to help other people with it ...



Am 30.03.2016 um 13:02 schrieb Juha Manninen:

It is explained here somehow:
   http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals
And the first example there is wrong (or the words "and without" need to 
be removed). With no defined codepage

const s: string = 'äöü';
has codepoints of a UTF-8 String, the codepage is 0. If you assign it to 
a string with a declared codepage, you get a corrupted string. See my 
example.



BTW I've forgotten ShortStrings in my example and maybe PChar. I'll add 
it (if possible) and try and report later.


Thanks for your attention and time

Kindly regards

Michl

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Juha Manninen
On Wed, Mar 30, 2016 at 1:03 PM, Bart  wrote:
> The IDE at least runs fine (in my locale on Windows) with -FcUTF8.

Lazarus IDE does not have string constants beyond 7-bit ASCII.
Encoding does not matter obviously.

> (I have it there because I build all my projects with this define,
> because almost all of them contain some strings with diacritics)

Do your files have UTF-8 encoding? It is a necessity for the Unicode
system to work.
Any valid UTF-8 string should work, including diacritics.

> Curious though: in what scenario's does it fail?

See the forum link I gave earlier.

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Juha Manninen
On Wed, Mar 30, 2016 at 12:38 PM, Michael W. Vogel  wrote:
> With the hack that the LCL makes and the added {$codepage UTF8} all
> conversions work like a charm (see added testproject).

Conversions in your testproject may work, but you ignored the forum
link I gave earlier. There "malcome" gave examples that fail.
BTW, the hack is not made by LCL but by LazUtils which can be used
also with cmd line / server programs.

>> LCL applications nowadays use CP_UTF8 as default. We (laz team) tested
>> adding -FcUTF8 and it failed in too many cases. Also it adds some overhead.
>> So we decided to *not* add it by default.
>
> I also don't want it. I want a added {$codepage UTF8}, if the file is saved
> as a UTF-8 encoded one.

The cases fail with UTF-8 file encoding.
All files created by Lazarus IDE are by default saved with UTF-8
encoding, thus you would get {$codepage UTF8} in every file which is
the same as -FcUTF8 for the whole project.
Adding -FcUTF8 is already extremely easy. There is a button for it in
Project Options -> Custom Options page.

The issue with constant string encodings is more complex than you seem
to understand.
It is explained here somehow:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Bart
On 3/30/16, Mattias Gaertner  wrote:

> LCL applications nowadays use CP_UTF8 as default. We (laz team) tested
> adding
> -FcUTF8 and it failed in too many cases. Also it adds some overhead. So we
> decided to *not* add it by default.

The IDE at least runs fine (in my locale on Windows) with -FcUTF8.
(I have it there because I build all my projects with this define,
because almost all of them contain some strings with diacritics)

I'm too lazy just yet to insert proper codepage defines in the relevant places.

Curious though: in what scenario's does it fail?

Bart

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Michael W. Vogel

Am 30.03.2016 um 10:13 schrieb Juha Manninen:

I don't know what is a a Predefined String
Sorry, I was not clear. I mean a string with a declared codepage 
http://wiki.freepascal.org/FPC_Unicode_support#Declared_code_page


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Michael W. Vogel

Am 30.03.2016 um 10:11 schrieb Mattias Gaertner:
You have to distinguish between with CP_UTF8 as default and with 
CP_ACP as default. 

Yes, I know it. What I mean with default:

Go to Project -> New Project ... -> Application

Now a new Application is created. With the added patch {$codepage UTF8} 
is added and thats right, cause if you save the file to anywhere it is 
UTF-8 encoded. There is nothing wrong.



Am 30.03.2016 um 10:11 schrieb Mattias Gaertner:
Both have cases where some string combinations fail. 
With the hack that the LCL makes and the added {$codepage UTF8} all 
conversions work like a charm (see added testproject).


If you want to use -dDisableUTF8RTL, you have to know, what you do.
- remove {$codepage UTF8}, better set it to the valid codepage or use 
-FcCP...

- save all the source files with the wanted encoding/codepage

So IMHO this special case wouldn't be used much. All the puzzled 
discussions I can see in the forums are about the default applications 
created with Lazarus (not FPC), that uses the LCL.



Am 30.03.2016 um 10:11 schrieb Mattias Gaertner:
LCL applications nowadays use CP_UTF8 as default. We (laz team) tested 
adding -FcUTF8 and it failed in too many cases. Also it adds some 
overhead. So we decided to *not* add it by default. 
I also don't want it. I want a added {$codepage UTF8}, if the file is 
saved as a UTF-8 encoded one.



Am 30.03.2016 um 10:11 schrieb Mattias Gaertner:

  Offtopic: In the added project: Why is a const 'abc' with {$codepage UTF8} a
Unicodestring (Windows7, 64bit, Lazarus 1.7 r52077M FPC 3.1.1
i386-win32-win32/win64 on FPC 3.1.1 r33371)?

There is no compile-time flag to tell the compiler what codepage the system is
using at runtime. So it assumes current Windows codepage. Any string literal not
in this codepage is stored as UTF-16.


Thanks for that hint!

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Martin Schreiber
On Wednesday 30 March 2016 11:23:49 Juha Manninen wrote:
> > If one wants to handle BMP-chars comfortably and with good performance
> > one has to convert from utf-8 in AnsiString to UnicodeString first.
>
> Maybe, but BMP-chars are not enough for a proper Unicode support.

But they are enough to be used in Russian and German pupils homework, utf-8 
code units are not enough, please read lazarusformum.de. ;-)

Martin

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Juha Manninen
Ok, FPC had UnicodeString earlier than I remembered.
Currently WideString is often used with WinAPI when UnicodeString
should be used, as Marco reminded in another discussion.

Anyway, the problems found by Michael W. Vogel and "malcome" all deal
with constants. Assignment between variables always works thanks of
their dynamic encoding.
If there is doubt about how a constant is interpreted, it can be first
assigned to a "String" type variable which can then be typecasted to a
WinAPI's UnicodeString parameter or whatever.
In the worst case scenario an extra  "String" variable is needed.
Or, if one wants to use UnicodeString constants in a unit, he can add
{$codepage utf8}. No big deal.
IMO the problems are exaggerated.


On Wed, Mar 30, 2016 at 11:58 AM, Martin Schreiber  wrote:
> If one wants to handle BMP-chars comfortably and with good performance one has
> to convert from utf-8 in AnsiString to UnicodeString first.

Maybe, but BMP-chars are not enough for a proper Unicode support.
Besides, dealing with codepoints is the easy part regardless of
encoding. The associated problems are exaggerated again.
The true complexity of Unicode is beyond codepoints.

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Martin Schreiber
On Wednesday 30 March 2016 10:13:36 Juha Manninen wrote:

> With Unicodestring we don't need to care about backwards compatibility
> really because it is so new type.

Ouch!

WideString has been introduced in Delphi 4 IIRC, FPC had an on all platforms 
reference counted 16-bit string which worked like current UnicodeString. IIRC 
it was about version 1.8 when FPC introduced this string type.
Kylix WideString (Linux) also was reference counted.
Later FPC changed WideString on Windows ( against my strong opposition, 
well-understood ;-)  ) to the not reference counted OLE-string.
A little bit later FPC added the on all platforms reference counted 
UnicodeString again.
So one can say that at the moment when Lazarus became Unicode capable there 
was a UnicodeString-like stringtype available in FPC. It was very buggy, so 
probably this was one of the reasons that Lazarus used utf-8 in AnsiString 
instead.
For MSEgui on the other hand I used WideString/UnicodeString from beginning 
and wrote FPC bug-reports until FPC WideString became production ready.

> What more, Unicodestring is not needed often when using our new Unicode
> system.
>
If one wants to handle BMP-chars comfortably and with good performance one has 
to convert from utf-8 in AnsiString to UnicodeString first.

Martin

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Juha Manninen
No, originally we had -FcUTF8 set by default but it caused more problems.
See:
  http://forum.lazarus.freepascal.org/index.php?topic=30022

> In the most cases the string magic works without a defined {$codepage utf8},
> but not if you want to assign a const to a Predefined String or Unicodestring.

I don't know what is a a Predefined String but assigning a const to
Unicodestring can be seen as a special case and then a programmer can
take special actions (add {$codepage utf8} himself).

Leaving out {$codepage utf8} is the most backwards compatible way, and
the operation is most intuitive.
With Unicodestring we don't need to care about backwards compatibility
really because it is so new type.
What more, Unicodestring is not needed often when using our new Unicode system.

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus


Re: [Lazarus] Feature Request: Insert {codepage UTF8} per default

2016-03-30 Thread Mattias Gaertner

> "Michael W. Vogel"  hat am 29. März 2016 um 23:20
> geschrieben:
>[...]
>  I'm thinking about the thread here
> http://forum.lazarus.freepascal.org/index.php/topic,31939.msg206688.html#msg206688[...]
>  In the most cases the string magic works without a defined {$codepage utf8},
> but not if you want to assign a const to a Predefined String or Unicodestring.
> 

You have to distinguish between with CP_UTF8 as default and with CP_ACP as
default.
Both have cases where some string combinations fail.

LCL applications nowadays use CP_UTF8 as default. We (laz team) tested adding
-FcUTF8 and it failed in too many cases. Also it adds some overhead. So we
decided to *not* add it by default.

 
>  Offtopic: In the added project: Why is a const 'abc' with {$codepage UTF8} a
> Unicodestring (Windows7, 64bit, Lazarus 1.7 r52077M FPC 3.1.1
> i386-win32-win32/win64 on FPC 3.1.1 r33371)? 

There is no compile-time flag to tell the compiler what codepage the system is
using at runtime. So it assumes current Windows codepage. Any string literal not
in this codepage is stored as UTF-16.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus