subject:"Re\: \[Lazarus\] UTF8 RTL for Windows"

Re: [Lazarus] UTF8 RTL for Windows

2014-11-26 Thread Michael Schnell


On 11/25/2014 09:39 PM, Hans-Peter Diettrich wrote:


The Delphi model already broke that claimed type safety, by omitting 
conversions of RawByteString results, for speed optimization. That's 
dangerous, because the compiler can *only* check the static type of 
string variables, but not the dynamic encoding of their contents
This was clear to me just after exploring and understanding encoded 
strings in Delphi. In FPC/Lazarus we now have a *chance* for 
simplifications and improvements, when the new features are used in 
the *right* way.


On that behalf I just posted a set of questions on the FPC Unicode 
support wiki page in the fpc-devel mailing list. Please continue this 
discussion there.



But many arguments and opinions, presented in this thread, indicate to 
me an yet incomplete understanding and many misunderstandings, which I 
actually try to spot. 


See the new wiki page 
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Michael Schnell


On 11/24/2014 10:15 PM, Hans-Peter Diettrich wrote:



I'm missing documentation for working safely (and efficiently) with 
such irregular strings, most probably none of the FPC (and Delphi) 
developers ever noticed how users are left alone with this problem :-(


Hmm. In the fpc-devel, lazarus-devel, lists and in the German Lazarus 
Forum I had been involved in lots of long winding threads on this issue. 
So the developers do listen to the users !


Unfortunately the design of Delphi seems not to be really nice. AFAIK 
Details of  it's behavior changed between the first versions that 
offered NewStrings. This suggests that the design goal was not well 
defined with the first issue and the facts that had been set with same, 
unfortunately needed to be sticked to to avoid breaking user code (which 
happened to my colleagues none the less).


Unfortunately fpc seems to need to follow whatever Delphi.

Michael (due to come back with a dedicated thread soon).

-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Hans-Peter Diettrich


Mattias Gaertner schrieb:

On Mon, 24 Nov 2014 22:53:44 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:


Graeme Geldenhuys schrieb:


How is ThousandSeparator and DecimalSeparator supposed to work it
TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian
thousand separator (4-byte non-breaking white space character) for
example will not fit into a Char type.

The Char type is quite useless with Unicode,


Correction: *This* Char type needs to be extended.


Please specify.


Char in general is very useful.

at least if it has less 
than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, 
assuming that a character always fits into a Char (like the Pos 
overload...).


There is a Pos overload for strings. Where is the flaw in Pos?


The flaw is the added overload with a Char parameter.
Furthermore the Pos arguments should never be subject to automatic 
conversion, otherwise the returned index will be useless.




In the best case Char could be retyped into an string (substring),


That would be wrong in 99.9% of the cases.


Please give at least one example.

DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Hans-Peter Diettrich


Mattias Gaertner schrieb:

On Mon, 24 Nov 2014 22:15:29 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:


[...]
The Delphi (and FPC) encoding model allows for strings of different 
static (declared) and dynamic (true content) encoding, see the special 
handling of RawByteString (Wiki).


So far it's not a good idea to simply *assume* that a string variable 
contains bytes of the declared encoding. In detail one should check or 
force the right dynamic encoding of every string variable, before 
searching for specific bytes (chars) in it.


I'm missing documentation for working safely (and efficiently) with such 
irregular strings, most probably none of the FPC (and Delphi) developers 
ever noticed how users are left alone with this problem :-(


Maybe I don't understand the question, but it seems to me this is
documented where static-, dynamic cp and rawbytestring are explained.


More concrete questions:

How can a user be sure that a string parameter in a subroutine has the 
specified encoding?

How to check, how to fix if needed?




http://wiki.freepascal.org/FPC_Unicode_support#Ansistring

When a procedure requires a specific encoding it uses a specific String
type. If it works with CP_ACP it uses String. If it needs UTF8 it
uses UTF8String.


Such specifications are meaningless when the string parameters can have 
a different dynamic encoding :-(


Unicode Delphi works well as long as only one codepage (CP_ACP) is used, 
in addition to Unicode (UTF-16) strings. As soon as multiple codepages 
can be involved at the same time, the dynamic string encodings become 
almost random (observed in Delphi XE). FPC now already has multiple 
built-in codepage variables (DefaultSystemCodePage...), with possibly 
different values, so that the observed Delphi mess is inevitable, as 
long as RawByteString results (of e.g. standard stringhandling 
functions) are *not* converted when assigned to a string variable of 
some specific static encoding.


Unfortunately I cannot test Lazarus trunk since a long time, no answer 
on my request for assistance. So I have to wait for the next installable 
download, before I can give concrete examples.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Mattias Gaertner

On Tue, 25 Nov 2014 13:10:26 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:

[...]
  Maybe I don't understand the question, but it seems to me this is
  documented where static-, dynamic cp and rawbytestring are explained.
 
 More concrete questions:
 
 How can a user be sure that a string parameter in a subroutine has the 
 specified encoding?
 How to check, how to fix if needed?

As you know in general you cannot find out the encoding of a text. You
have to trust that the caller gave the right encoding.
This was true before 2.7.1 and it still is.
The new thing with 2.7.1 is that String now has an encoding field and
that you can use this to let the compiler convert encodings
automatically.
For example the RTL uses this to convert between OS strings and program
strings. This means some RTL functions don't need manual encoding
conversions (e.g. UTF8ToAnsi) anymore. You can simply pass the string.
Hopefully more and more RTL functions/variables will be converted.

In short: Most of the time you code exactly like before.

If your code works with various encodings, then formerly you had to be
very careful what you do with the strings. For example when you pass
the strings to the RTL you had to convert them to the system codepage.
Now you can use for instance UTF8String instead and omit the
UTF8ToAnsi. It is like gaining some type safety.
And you can now use SetCodePage. But then you have to be very careful
again.


  http://wiki.freepascal.org/FPC_Unicode_support#Ansistring
  
  When a procedure requires a specific encoding it uses a specific String
  type. If it works with CP_ACP it uses String. If it needs UTF8 it
  uses UTF8String.
 
 Such specifications are meaningless when the string parameters can have 
 a different dynamic encoding :-(

Please read the paragraph Dynamic code page again. The example it
describes is the most common case: the system code page. This is
the same as FPC 2.6.5 and below. A String coming from the OS has
the system code page, which is dynamic. If you want a specific 
encoding you had to convert it.
With FPC 2.7.1 we have a new possibility. This is the new mode I was
talking about. Now we get UTF-8 strings in many places in the RTL. Not
all places yet. But we are working on it. And you can help.

 
 Unicode Delphi works well as long as only one codepage (CP_ACP) is used, 
 in addition to Unicode (UTF-16) strings. As soon as multiple codepages 
 can be involved at the same time, the dynamic string encodings become 
 almost random (observed in Delphi XE). FPC now already has multiple 
 built-in codepage variables (DefaultSystemCodePage...), with possibly 
 different values, so that the observed Delphi mess is inevitable, as 
 long as RawByteString results (of e.g. standard stringhandling 
 functions) are *not* converted when assigned to a string variable of 
 some specific static encoding.

Well, two weeks ago I was rolling my eyes when I read about this
complex system and DefaultSystemCodePage. But then I tried to set it
and now we can use one String encoding cross platform and it works
with file functions, TStringList and friends. Almost all of the
UTF8ToSys calls are no longer needed and file functions now support
full Unicode.
We can write an Unicode program cross platform using our normal strings
and classes.
And it is pretty compatible. 
So from Lazarus point of view this is a great step forward.
And last but not least: it is optional.

Of course if you have a product and you have to support all old modes
and some of the new possibilities you will curse.


 Unfortunately I cannot test Lazarus trunk since a long time, no answer 
 on my request for assistance.

?

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Mattias Gaertner

On Tue, 25 Nov 2014 11:53:00 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:

[...]
  Correction: *This* Char type needs to be extended.
 
 Please specify.

The ThousandSeparator type is Char, which does not work with
Russian in UTF-8. Well, at least if you want the non breakable space
instead of the normal space.
There are many cases where Char is enough.

 
[...]  Char in general is very useful.
 ... 
  at least if it has less 
  than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, 
  assuming that a character always fits into a Char (like the Pos 
  overload...).
  
  There is a Pos overload for strings. Where is the flaw in Pos?
 
 The flaw is the added overload with a Char parameter.

I use that a lot. It is faster than the string variant.
Why is that a flaw?

 Furthermore the Pos arguments should never be subject to automatic 
 conversion, otherwise the returned index will be useless.

You can argue the same way in the direction: If it does not
automatically convert it will find crap.

 
  In the best case Char could be retyped into an string (substring),
  
  That would be wrong in 99.9% of the cases.
 
 Please give at least one example.

Retype Char to String and the compiler will bark. For example in
Graphics.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Felipe Monteiro de Carvalho

On Tue, Nov 25, 2014 at 2:45 PM, Mattias Gaertner
nc-gaert...@netcologne.de wrote:
 Retype Char to String and the compiler will bark. For example in
 Graphics.

What about changing to WideChar then?

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Mattias Gaertner

On Tue, 25 Nov 2014 14:49:52 +0100
Felipe Monteiro de Carvalho felipemonteiro.carva...@gmail.com wrote:

 On Tue, Nov 25, 2014 at 2:45 PM, Mattias Gaertner
 nc-gaert...@netcologne.de wrote:
  Retype Char to String and the compiler will bark. For example in
  Graphics.
 
 What about changing to WideChar then?

If you mean unit Graphics: It checks for ASCII characters. So a change
to WideChar would add implicit conversions without any gain.

In case of ThousandSeparator:
That would probably be sufficient. Although some code needs to
be adapted.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Felipe Monteiro de Carvalho

On Tue, Nov 25, 2014 at 3:14 PM, Mattias Gaertner
nc-gaert...@netcologne.de wrote:
 What about changing to WideChar then?

 If you mean unit Graphics: It checks for ASCII characters. So a change
 to WideChar would add implicit conversions without any gain.

 In case of ThousandSeparator:
 That would probably be sufficient. Although some code needs to
 be adapted.

I ment for ThousandSeparator.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Frederic Da Vitoria

2014-11-25 14:45 GMT+01:00 Mattias Gaertner nc-gaert...@netcologne.de:

 On Tue, 25 Nov 2014 11:53:00 +0100
 Hans-Peter Diettrich drdiettri...@aol.com wrote:

 [...]
   Correction: *This* Char type needs to be extended.
 
  Please specify.

 The ThousandSeparator type is Char, which does not work with
 Russian in UTF-8. Well, at least if you want the non breakable space
 instead of the normal space.


French uses non-breakable space too. According to several sources, the
correct character should actually be narrow no-break space
https://en.wikipedia.org/wiki/Thin_space.

-- 
Frederic Da Vitoria
(davitof)

Membre de l'April - « promouvoir et défendre le logiciel libre » -
http://www.april.org
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Hans-Peter Diettrich


Mattias Gaertner schrieb:

On Tue, 25 Nov 2014 13:10:26 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:


[...]

Maybe I don't understand the question, but it seems to me this is
documented where static-, dynamic cp and rawbytestring are explained.

More concrete questions:

How can a user be sure that a string parameter in a subroutine has the 
specified encoding?

How to check, how to fix if needed?


As you know in general you cannot find out the encoding of a text. You
have to trust that the caller gave the right encoding.
This was true before 2.7.1 and it still is.
The new thing with 2.7.1 is that String now has an encoding field and
that you can use this to let the compiler convert encodings
automatically.
For example the RTL uses this to convert between OS strings and program
strings. This means some RTL functions don't need manual encoding
conversions (e.g. UTF8ToAnsi) anymore. You can simply pass the string.
Hopefully more and more RTL functions/variables will be converted.

In short: Most of the time you code exactly like before.


FACK, so far :-]


If your code works with various encodings, then formerly you had to be
very careful what you do with the strings. For example when you pass
the strings to the RTL you had to convert them to the system codepage.
Now you can use for instance UTF8String instead and omit the
UTF8ToAnsi. It is like gaining some type safety.


The Delphi model already broke that claimed type safety, by omitting 
conversions of RawByteString results, for speed optimization. That's 
dangerous, because the compiler can *only* check the static type of 
string variables, but not the dynamic encoding of their contents.



And you can now use SetCodePage. But then you have to be very careful
again.


SetCodePage is safe, as long as it enforces an according conversion of 
the dynamic string encoding. The option, of only changing the encoding 
field, is reserved for adjustments after reading strings from external 
sources, or from Char, Char arrays/pointers or ShortString, where the 
correct codepage is unknown to the compiler and library routines.




http://wiki.freepascal.org/FPC_Unicode_support#Ansistring

When a procedure requires a specific encoding it uses a specific String
type. If it works with CP_ACP it uses String. If it needs UTF8 it
uses UTF8String.
Such specifications are meaningless when the string parameters can have 
a different dynamic encoding :-(


Please read the paragraph Dynamic code page again.


Please read my statement again, you still miss my point.



With FPC 2.7.1 we have a new possibility. This is the new mode I was
talking about. Now we get UTF-8 strings in many places in the RTL. Not
all places yet. But we are working on it. And you can help.


I'm trying to help all the time, but if you don't understand my 
arguments, I cannot help you :-(


I've explored the encoded AnsiStrings in Delphi XE, years ago, and 
identified a couple of problems with the Delphi implementation. I can 
help by explaining these problems, and how to avoid or reduce these 
problems in FPC/Lazarus. But according fixes to legacy code must be 
applied by the maintainers of that code, who know about the *right* way 
(intended behaviour) to fix every single problem.




Well, two weeks ago I was rolling my eyes when I read about this
complex system and DefaultSystemCodePage. But then I tried to set it
and now we can use one String encoding cross platform and it works
with file functions, TStringList and friends. Almost all of the
UTF8ToSys calls are no longer needed and file functions now support
full Unicode.


This was clear to me just after exploring and understanding encoded 
strings in Delphi. In FPC/Lazarus we now have a *chance* for 
simplifications and improvements, when the new features are used in the 
*right* way. But many arguments and opinions, presented in this thread, 
indicate to me an yet incomplete understanding and many 
misunderstandings, which I actually try to spot.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-25 Thread Hans-Peter Diettrich




Mattias Gaertner schrieb:

On Tue, 25 Nov 2014 14:49:52 +0100
Felipe Monteiro de Carvalho felipemonteiro.carva...@gmail.com wrote:


On Tue, Nov 25, 2014 at 2:45 PM, Mattias Gaertner
nc-gaert...@netcologne.de wrote:

Retype Char to String and the compiler will bark. For example in
Graphics.

What about changing to WideChar then?


If you mean unit Graphics: It checks for ASCII characters. So a change
to WideChar would add implicit conversions without any gain.


You see that Unicode handling requires more than only changing declarations?
[Where changing Char to Byte in Graphics might be sufficient, as long as 
such bytes are not kept in Strings]



In case of ThousandSeparator:
That would probably be sufficient. Although some code needs to
be adapted.


Then you should also see that certain means should at least *allow* to 
*identify* code that is not sufficiently Unicode-aware. This would not 
only allow the FPC/Lazarus developers to identify flaws in the standard 
libraries, but also users will appreciate spotted flaws in their legacy 
code :-)


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Michael Schnell


On 11/23/2014 07:52 PM, Felipe Monteiro de Carvalho wrote:


Well, the first reports of how the unicode rtl would look like were
pretty scary: Total break of the string part of millions of lines of
code that people wrote with Lazarus since years.

That is why I stopped recommending Lazarus to my colleagues who are 
doing Delphi.


They took a huge amount of pain to convert their software from Delphi 
one byte strings to Delphi two bytes strings. Hence they will not be 
pleased to be forced to convert back to one byte strings to be able to 
use Lazarus and some time later convert to two byte strings again once 
Lazarus might be forced to finally follow Delphi on that behalf.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Michael Schnell


On 11/22/2014 05:18 PM, Hans-Peter Diettrich wrote:
Does this mean that Lazarus (new mode) ignores the OS system codepage 
setting?


IMHO that would be just GREAT to allow for doing portable software. The 
RTL and LCL interface should be OS ignorant for portability. In user 
code, the user should be allowed to use the string encoding (and byte 
cont per character), he finds the most convenient for his application.


OTOH this of course does provide a decent set of  problems including but 
not limited to unnecessary conversions in certain cases.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread luiz americo pereira camara

2014-11-24 6:29 GMT-03:00 Michael Schnell mschn...@lumino.de:

 On 11/23/2014 07:52 PM, Felipe Monteiro de Carvalho wrote:


 Well, the first reports of how the unicode rtl would look like were
 pretty scary: Total break of the string part of millions of lines of
 code that people wrote with Lazarus since years.

  That is why I stopped recommending Lazarus to my colleagues who are
 doing Delphi.

 They took a huge amount of pain to convert their software from Delphi one
 byte strings to Delphi two bytes strings. Hence they will not be pleased to
 be forced to convert back to one byte strings to be able to use Lazarus and
 some time later convert to two byte strings again once Lazarus might be
 forced to finally follow Delphi on that behalf.


If the program does not explicitely assumesa specific encoding, i.e. use
only String type and do not do low level string handling, there will be no
need to change.

I did/do convert a lot of Delphi components and can assure that most will
not need changes as is today

Luiz
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Michael Schnell


On 11/24/2014 11:44 AM, luiz americo pereira camara wrote:


If the program does not explicitely assumesa specific encoding, i.e. 
use only String type and do not do low level string handling, there 
will be no need to change.


I don't know the internals of the program(s). It's a huge system and 
does anything that somehow might be possible :-) .


-Michael
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Juha Manninen

On Mon, Nov 24, 2014 at 11:33 AM, Michael Schnell mschn...@lumino.de wrote:
 IMHO that would be just GREAT to allow for doing portable software. The RTL
 and LCL interface should be OS ignorant for portability. In user code, the
 user should be allowed to use the string encoding (and byte cont per
 character), he finds the most convenient for his application.

 OTOH this of course does provide a decent set of  problems including but not
 limited to unnecessary conversions in certain cases.

See the request from Mattias :
Please test and tell what you find out.

Michael Schnell and others, let's keep this thread in a more congrete level.
You can start another philosophical thread about how strings should be
in a perfect world.

Juha

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Sun, 23 Nov 2014 21:37:56 -0300
luiz americo pereira camara luiz...@oi.com.br wrote:

 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:
[...]

First of all: Thanks for testing.

 Without {$codepage utf8} directive String constants will get Code Page 0
 (CP_ACP) and not the 1200 (UTF16 - UnicodeString).

Beware: There are different types of string constants.

 
 String variables assigned to those constants will also have Code Page = 0
 
 This is because the constant string code page is evaluated at compile time
 
 Not sure if there's a compiler command line param with same effect as
 {$codepage utf8}
 
 The attached program show how data loss can occur

The program uses writeln, which converts to console CP.
When you save the strings to a file you can see what they contain. Or
write the byte values.

This works with or without {$codepage utf8}:

S := 'João'; // constant to (Ansi or Short)string
W:=S; 
SUTF8:=S;

const c: string = 'João';
W:=c; // constant to Wide/Unicode/UTF8String

This requires {$codepage utf8} or -Fcutf8:

W := 'João'; // constant to Wide/Unicode/UTF8string 

const c = 'João';
W:=c;

I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For
both modes.


Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Mon, 24 Nov 2014 12:15:03 +0100
Mattias Gaertner nc-gaert...@netcologne.de wrote:

[...]
 I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For
 both modes.

On second thought: only for new mode. 
Passing it in the old mode will make the wide/unicode/utf8string work,
but the Ansi/Shortstring will be wrong.

We need a table in the wiki. FPC 2.6.5 and below, FPC 2.7.1+
and FPC 2.7.1+ with UTF8 as default CP. And with or without {$codepage
utf8}.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Michael Schnell


On 11/24/2014 12:01 PM, Juha Manninen wrote:

See the request from Mattias : Please test and tell what you find out.


I have not enough knowledge to be able to patch the compiler :-(


let's keep this thread in a more congrete level.

Agreed (even if I don't think that will lead to anything fairly portable.).

As requested by Michael vC, I will do a Wiki page tomorrow and start a 
new Thread based on this.


-Michael



--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Mon, 24 Nov 2014 13:12:04 +0100
Michael Schnell mschn...@lumino.de wrote:

 On 11/24/2014 12:01 PM, Juha Manninen wrote:
  See the request from Mattias : Please test and tell what you find out.
 
 I have not enough knowledge to be able to patch the compiler :-(

I asked for testing compiling with -dEnableUTF8RTL.
Don't hijack threads.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Sun, 23 Nov 2014 18:27:12 -0300
luiz americo pereira camara luiz...@oi.com.br wrote:

 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:
[...]
 Please test and tell what you find out.
 
 
 The FormatSettings fields are still encoded with System Code Page
 regardless of DefaultSystemCodePage value.
 
 While for english locales there's no problem, other locales like PT-BR have
 accented names in days and monthes.
 
 The problem is in windows SysUtils.GetLocaleStr function that uses non
 unicode Win Api function. This problem will affect also the UnicodeString
 RTL.
 
 Attached is a test app that shows the issue. It also has a version of
 GetLocaleStr that fixes the issue for the RTL (both versions)

Thanks. It works here too.

I reported it:
http://bugs.freepascal.org/view.php?id=27086

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 11/23/2014 07:52 PM, Felipe Monteiro de Carvalho wrote:


Well, the first reports of how the unicode rtl would look like were
pretty scary: Total break of the string part of millions of lines of
code that people wrote with Lazarus since years.

That is why I stopped recommending Lazarus to my colleagues who are 
doing Delphi.


They took a huge amount of pain to convert their software from Delphi 
one byte strings to Delphi two bytes strings.


I had similar problems, but only in porting a huge codebase from 
ShortString to AnsiString. The move from D5 to XE was painless then, 
only the uses lists deserved some updates. In so far it might be a good 
idea to educate some old-school Delphi coders, how to deal with managed 
strings and other past-BP items in general.


As for Lazaurs, I think that UTF-8 is the best choice for multi-platform 
projects, with almost no extra conversions required on any platform.
Please note that until now Windows did the Ansi to UTF conversions 
itself, in every API call with strings involved. If this was not noticed 
before, the conversions won't be noticeable afterwards as well.


A move to UTF-16 instead will only favor Windows, while additional 
string conversions will be required on almost every other platform. I 
think that FPC/Lazarus should fork and support separate libraries 
(RTL...) for UTF-8 and UTF-16 strings, if compatibility with newer 
Delphi VCL projects is desired. Full Delphi compatibility would also 
require a FireMonkey replacement for the LCL, and that were another very 
new project, extending the UTF-16 branch (only).


Just my 0.02€
DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Michael Schnell


On 11/24/2014 02:19 PM, Hans-Peter Diettrich wrote:


A move to UTF-16 instead will only favor Windows,

Regarding the RTL interface, you of course are right.

Doing the user software with UTF-16 instead of RTZF-8 strings, in many 
cases (but of course not perfectly) allows for keeping old-style 1-Byte 
ANSI code using s[n], and manually using the result of pos().


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Sven Barth

Am 24.11.2014 14:55 schrieb Hans-Peter Diettrich drdiettri...@aol.com:
 Please note that until now Windows did the Ansi to UTF conversions
itself, in every API call with strings involved. If this was not noticed
before, the conversions won't be noticeable afterwards as well.

This is something that one definitely shoudln't forget! Up to now Windows
did the conversion for us and do we see people complaining about the
conversion during API calls? No, we don't...

Regards,
Sven
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Michael Schnell


On 11/24/2014 02:50 PM, Hans-Peter Diettrich wrote:


code, the user should be allowed to use the string encoding (and byte 
cont per character), he finds the most convenient for his application.


I'm not sure what exactly you mean here.
Here I menat that for a *new project* the user might be willing to 
choose e.g. either UTF-16 (sometimes easier to use) or utf-8 (sometimes 
faster and less memory overhead) for his own code, while the RTL might 
be done specifically in favor of the OS.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner


Please don't start an UTF war again.

This has been discussed in length and a zillion times.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Graeme Geldenhuys

On 2014-11-24 10:52, Michael Schnell wrote:
 I don't know the internals of the program(s). It's a huge system and 
 does anything that somehow might be possible :-) .

Luckily you have everything unit tested right. So it would simply be a
case of running the test suite to see what works and what doesn't. ;-)


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread luiz americo pereira camara

2014-11-24 8:15 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:

 On Sun, 23 Nov 2014 21:37:56 -0300
 luiz americo pereira camara luiz...@oi.com.br wrote:

  The attached program show how data loss can occur

 The program uses writeln, which converts to console CP.
 When you save the strings to a file you can see what they contain. Or
 write the byte values.


Yes. I improved the program (see message that followed) to write the bytes
values so the comparison should be more exact.


 This works with or without {$codepage utf8}:

 S := 'João'; // constant to (Ansi or Short)string


Without {$codepage utf8}
When DefaultSystemCodePage is CP_ACP the variable S will have the content
of UTF8 but the encoding will be ACP (in my case 1252), just like is today.
With DefaultSystemCodePage as CP_UTF8 both content and code page will match

[..]

I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For
 both modes.


Probably yes.
There's one case that must be tested. When the file is encoded in ansi like
those shared with Delphi.
What i understand with -Fcutf8, the compiler will interpret those content
as UTF8 creating wrong encoded constant.

$codepage directive overrides -Fcutf8?
If so, to fix the developer could use $codepage with the correct file
encoding

Luiz
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Mon, 24 Nov 2014 12:45:54 -0300
luiz americo pereira camara luiz...@oi.com.br wrote:

 2014-11-24 8:15 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:
[...]
  This works with or without {$codepage utf8}:
 
  S := 'João'; // constant to (Ansi or Short)string
 
 
 Without {$codepage utf8}
 When DefaultSystemCodePage is CP_ACP the variable S will have the content
 of UTF8 but the encoding will be ACP (in my case 1252), just like is today.
 With DefaultSystemCodePage as CP_UTF8 both content and code page will match

Yes, but CP_ACP is treated as CP_UTF8. So it does not matter.

 
 [..]
 
 I guess it would be a good idea to pass -Fcutf8 with FPC 2.7.1. For
  both modes.
 
 
 Probably yes.
 There's one case that must be tested. When the file is encoded in ansi like
 those shared with Delphi.
 What i understand with -Fcutf8, the compiler will interpret those content
 as UTF8 creating wrong encoded constant.

Yes.
 
 $codepage directive overrides -Fcutf8?

Yes.

 If so, to fix the developer could use $codepage with the correct file
 encoding

Yes.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Graeme Geldenhuys

On 2014-11-22 16:38, Michael Van Canneyt wrote:
 The exact behaviour of the RTL is controlled by a couple of variables:
 DefaultSystemCodePage, DefaultFileSystemCodePage , 
 DefaultRTLFileSystemCodePage.

I've read the updated wiki page, but still confused about something...

  TFormatSettings = record
CurrencyFormat: Byte;
NegCurrFormat: Byte;
ThousandSeparator: Char;
DecimalSeparator: Char;
...snip...


How is ThousandSeparator and DecimalSeparator supposed to work it
TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian
thousand separator (4-byte non-breaking white space character) for
example will not fit into a Char type.

I haven't read this whole thread yet, and haven't played with the latest
FPC 2.7.1 yet - so maybe I'm just missing some key information for now.

Or is TFormatSettings just something that hasn't yet been converted to
be Unicode friendly?


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Mon, 24 Nov 2014 16:25:15 +
Graeme Geldenhuys mailingli...@geldenhuys.co.uk wrote:

[...]
 Or is TFormatSettings just something that hasn't yet been converted to
 be Unicode friendly?

It has not yet been converted.

We can help the FPC team by collecting all places.


Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Graeme Geldenhuys

On 2014-11-24 16:36, Mattias Gaertner wrote:
 It has not yet been converted.

Many thanks for confirming that.


 We can help the FPC team by collecting all places.

Where should we report this? Mantis or Unicode page of the Wiki?


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Hans-Peter Diettrich


luiz americo pereira camara schrieb:

When DefaultSystemCodePage is CP_ACP the variable S will have the 
content of UTF8 but the encoding will be ACP (in my case 1252), just 
like is today.

With DefaultSystemCodePage as CP_UTF8 both content and code page will match


The Delphi (and FPC) encoding model allows for strings of different 
static (declared) and dynamic (true content) encoding, see the special 
handling of RawByteString (Wiki).


So far it's not a good idea to simply *assume* that a string variable 
contains bytes of the declared encoding. In detail one should check or 
force the right dynamic encoding of every string variable, before 
searching for specific bytes (chars) in it.


I'm missing documentation for working safely (and efficiently) with such 
irregular strings, most probably none of the FPC (and Delphi) developers 
ever noticed how users are left alone with this problem :-(


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Hans-Peter Diettrich


Graeme Geldenhuys schrieb:


How is ThousandSeparator and DecimalSeparator supposed to work it
TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian
thousand separator (4-byte non-breaking white space character) for
example will not fit into a Char type.


The Char type is quite useless with Unicode, at least if it has less 
than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, 
assuming that a character always fits into a Char (like the Pos 
overload...).


In the best case Char could be retyped into an string (substring), so 
that it can hold any Unicode character *and* its encoding. Unicode 
stringhandling in general should always use substrings, for the same 
reasons. Until then 99.9% of occurences of Char in UTF-8 aware library 
or application code can be considered bugs :-(


The FPC team can sort out the real low-level code (most probably only 
the string conversion routines), the rest will become Delphi 
incompatible when fixed.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Mon, 24 Nov 2014 22:15:29 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:

[...]
 The Delphi (and FPC) encoding model allows for strings of different 
 static (declared) and dynamic (true content) encoding, see the special 
 handling of RawByteString (Wiki).
 
 So far it's not a good idea to simply *assume* that a string variable 
 contains bytes of the declared encoding. In detail one should check or 
 force the right dynamic encoding of every string variable, before 
 searching for specific bytes (chars) in it.
 
 I'm missing documentation for working safely (and efficiently) with such 
 irregular strings, most probably none of the FPC (and Delphi) developers 
 ever noticed how users are left alone with this problem :-(

Maybe I don't understand the question, but it seems to me this is
documented where static-, dynamic cp and rawbytestring are explained.

http://wiki.freepascal.org/FPC_Unicode_support#Ansistring

When a procedure requires a specific encoding it uses a specific String
type. If it works with CP_ACP it uses String. If it needs UTF8 it
uses UTF8String. If it can work with any 8-bit encoding it uses
RawByteString. If you need it even more detailed use the
StringCodePage function.

What else do you need?

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Mon, 24 Nov 2014 22:53:44 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:

 Graeme Geldenhuys schrieb:
 
  How is ThousandSeparator and DecimalSeparator supposed to work it
  TFormatSettings? If you switched the RTL to UTF-8 or UTF-16 a Russian
  thousand separator (4-byte non-breaking white space character) for
  example will not fit into a Char type.
 
 The Char type is quite useless with Unicode,

Correction: *This* Char type needs to be extended.
Char in general is very useful.

 at least if it has less 
 than 3 bytes (4 for UTF-8). There exist many more flaws in the RTL/LCL, 
 assuming that a character always fits into a Char (like the Pos 
 overload...).

There is a Pos overload for strings. Where is the flaw in Pos?

 
 In the best case Char could be retyped into an string (substring),

That would be wrong in 99.9% of the cases.

 so 
 that it can hold any Unicode character *and* its encoding. Unicode 
 stringhandling in general should always use substrings, for the same 
 reasons. Until then 99.9% of occurences of Char in UTF-8 aware library 
 or application code can be considered bugs :-(
 
 The FPC team can sort out the real low-level code (most probably only 
 the string conversion routines), the rest will become Delphi 
 incompatible when fixed.

Please give real world examples.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-24 Thread Mattias Gaertner

On Mon, 24 Nov 2014 16:40:06 +
Graeme Geldenhuys mailingli...@geldenhuys.co.uk wrote:

[...]
 Where should we report this? Mantis or Unicode page of the Wiki?

On a second thought, a programmer need to know what might fail and the
alternative/workaround. The latter depends on settings.
In case of the new LCL mode we can extend the LCL Unicode support page.


Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread Sven Barth

Am 23.11.2014 00:15 schrieb Mattias Gaertner nc-gaert...@netcologne.de:
  Additionally, most basic File I/O routines now correctly call the
underlying
  OS-es file routines with the codepage the OS expects (which is
WideString on Windows).

 Is it safe to say UTF-16? Or are there still UCS-2 Windows?

Till NT 4 inclusive it's UCS-2, since Windows 2000 it's UTF-16 (I don't
know and especially don't care about 9x).

Regards,
Sven
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread Michael Van Canneyt




On Sun, 23 Nov 2014, Mattias Gaertner wrote:


True. Although many programmers misunderstand what this means. It is not
as scary as it sounds.


To all the scared people:

Don't worry. Computers are not scary, not really. 
Just look at Terminator (or any other Sci-Fi involving computers), 
the humans always win in the end... :-)






Additionally, most basic File I/O routines now correctly call the underlying
OS-es file routines with the codepage the OS expects (which is WideString on 
Windows).


Is it safe to say UTF-16? Or are there still UCS-2 Windows?


I think some older versions of Windows are still UCS2, 
but I believe as of Windows 2000, it is all UTF-16. 
However, I am not an expert.



The exact behaviour of the RTL is controlled by a couple of variables:
DefaultSystemCodePage, DefaultFileSystemCodePage , DefaultRTLFileSystemCodePage.


Yes, that's the important bit that FPC made better than Delphi. :)


Phew... At least something we did better in the whole string mess ... ;)

Anyway, I was just trying to say that a 1-byte string is not necessarily UTF-8 
in FPC 2.7.1.

Michael.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread Mattias Gaertner

On Sun, 23 Nov 2014 13:56:42 +0100 (CET)
Michael Van Canneyt mich...@freepascal.org wrote:

[...]
 Anyway, I was just trying to say that a 1-byte string is not necessarily 
 UTF-8 in FPC 2.7.1.

Yes, you can still store anything you like in strings.
And you can store UTF-8 in a string and say it is not.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread Graeme Geldenhuys

On 2014-11-23 12:56, Michael Van Canneyt wrote:
 the humans always win in the end... :-)

ROFL

 Phew... At least something we did better in the whole string mess ... ;)

9/10 times FPC does everything better than Delphi.


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread Felipe Monteiro de Carvalho

On Sun, Nov 23, 2014 at 1:56 PM, Michael Van Canneyt
mich...@freepascal.org wrote:
 Don't worry. Computers are not scary, not really. Just look at Terminator
 (or any other Sci-Fi involving computers), the humans always win in the
 end... :-)

Well, the first reports of how the unicode rtl would look like were
pretty scary: Total break of the string part of millions of lines of
code that people wrote with Lazarus since years.

But now reading the latest report of how it will work out, i.e. that
Char=WideChar only in a special mode, and that you can set some
variables to get UTF-8 strings from RTL system calls, well, I haven't
actually tested it yet, but it looks like that maybe our code will not
break and maybe we won't need to review/fix hundreds of thousands of
lines of code that have worked since years

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread luiz americo pereira camara

2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:


 2. The new mode: The LCL, FCL and RTL treat all String as UTF-8
 encoded. Most RTL file functions now work with full Unicode.
 For example FileExists and aStringList.LoadFromFile(Filename) now
 support full Unicode.


[..]

Please test and tell what you find out.


The FormatSettings fields are still encoded with System Code Page
regardless of DefaultSystemCodePage value.

While for english locales there's no problem, other locales like PT-BR have
accented names in days and monthes.

The problem is in windows SysUtils.GetLocaleStr function that uses non
unicode Win Api function. This problem will affect also the UnicodeString
RTL.

Attached is a test app that shows the issue. It also has a version of
GetLocaleStr that fixes the issue for the RTL (both versions)

Luiz
program TestUTF8FormatSettings;

{$mode objfpc}{$H+}

uses
  {$ifdef Windows}
  Windows,
  {$endif}
  Classes, sysutils;

{$ifdef Windows}
function GetLocaleStrTest(LID, LT: Longint; const Def: string): String;
var
  L: Integer;
  Buf: array[0..255] of WideChar;
  W: WideString;
begin
  L := GetLocaleInfoW(LID, LT, Buf, SizeOf(Buf));
  if L  0 then
  begin
//SetString(Result, PWideChar(@Buf[0]), L - 1) leads to wrong result
//Bug in Procedure SetString (Out S : AnsiString; Buf : PWideChar; Len : SizeInt) ?
SetString(W, PWideChar(@Buf[0]), L - 1);
Result := W;
  end
  else
Result := Def;
end;
{$endif}

var
  i: Integer;
  S: String;
  List: TStringList;
begin
  WriteLn('DefaultSystemCodePage: ', DefaultSystemCodePage);
  DefaultSystemCodePage:=CP_UTF8;
  DefaultRTLFileSystemCodePage:=CP_UTF8;
  List := TStringList.Create;
  for i := 1 to 12 do
  begin
Write(StringCodePage(DefaultFormatSettings.LongMonthNames[i]), ' - ');
WriteLn(DefaultFormatSettings.LongMonthNames[i]);
List.Add(DefaultFormatSettings.LongMonthNames[i]);
  end;
  for i := 1 to 7 do
  begin
Write(StringCodePage(DefaultFormatSettings.LongDayNames[i]), ' - ');
WriteLn(DefaultFormatSettings.LongDayNames[i]);
List.Add(DefaultFormatSettings.LongDayNames[i]);
  end;
  {$ifdef Windows}
  S := GetLocaleStrTest(GetThreadLocale, LOCALE_SDAYNAME1+1, 'xx');
  Write(StringCodePage(S), ' - ');
  WriteLn(S);
  List.Add(S);
  {$endif}

  List.SaveToFile('TestUTF8FormatSettingsOut.txt');
  List.Destroy;
end.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread luiz americo pereira camara

2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:


 Please test and tell what you find out.



Without {$codepage utf8} directive String constants will get Code Page 0
(CP_ACP) and not the 1200 (UTF16 - UnicodeString).

String variables assigned to those constants will also have Code Page = 0

This is because the constant string code page is evaluated at compile time

Not sure if there's a compiler command line param with same effect as
{$codepage utf8}

The attached program show how data loss can occur

Luiz
program testStringConstantCP;

{$mode objfpc}{$H+}

uses
  Classes, sysutils;
var
  W: UnicodeString;
  S, S_2: String;
  SUTF8, SUTF8_2: UTF8String;
begin
  SetMultiByteConversionCodePage(CP_UTF8);
  W := 'João';
  Write('W: ': 10, StringCodePage(W), ' - ');
  WriteLn(W);

  S := 'João';
  Write('S: ': 10,StringCodePage(S), ' - ');
  WriteLn(S);

  S_2 := W;
  Write('S_2: ': 10,StringCodePage(S_2), ' - ');
  WriteLn(S_2);

  SUTF8 := W;
  Write('SUTF8: ': 10,StringCodePage(SUTF8), ' - ');
  WriteLn(SUTF8);

  SUTF8_2 := S;
  Write('SUTF8_2: ': 10, StringCodePage(SUTF8_2), ' - ');
  WriteLn(SUTF8_2);
end.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread luiz americo pereira camara

I added {.$codepage utf8} and all strings output as Joao.

Got confused. I did not to expect changes in the constant assigned to the
UnicodeString variable

Need to check what is the correct UTF8 output:  JoA£o or Joao

Luiz

2014-11-23 21:37 GMT-03:00 luiz americo pereira camara luiz...@oi.com.br:



 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:


 Please test and tell what you find out.



 Without {$codepage utf8} directive String constants will get Code Page 0
 (CP_ACP) and not the 1200 (UTF16 - UnicodeString).

 String variables assigned to those constants will also have Code Page = 0

 This is because the constant string code page is evaluated at compile time

 Not sure if there's a compiler command line param with same effect as
 {$codepage utf8}

 The attached program show how data loss can occur

 Luiz

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread luiz americo pereira camara

I updated the test app to show the hexadecimal representation of the string.

When {$codepage utf8} is set, all string encoding and content is right
matching each other regardless of MultiByteConversionCodePage

Without {$codepage utf8}:

When MultiByteConversionCodePage is CP_ACP (default) one string gets the
UTF8 content but code page is system ansi (1252 in my case)
When MultiByteConversionCodePage is UTF8 and two strings (converted from
WideString) get code page UTF8 but content is wrong

Luiz


2014-11-23 22:06 GMT-03:00 luiz americo pereira camara luiz...@oi.com.br:

 I added {.$codepage utf8} and all strings output as Joao.

 Got confused. I did not to expect changes in the constant assigned to the
 UnicodeString variable

 Need to check what is the correct UTF8 output:  JoA£o or Joao

 Luiz

 2014-11-23 21:37 GMT-03:00 luiz americo pereira camara luiz...@oi.com.br
 :



 2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de:


 Please test and tell what you find out.



 Without {$codepage utf8} directive String constants will get Code Page 0
 (CP_ACP) and not the 1200 (UTF16 - UnicodeString).

 String variables assigned to those constants will also have Code Page = 0

 This is because the constant string code page is evaluated at compile time

 Not sure if there's a compiler command line param with same effect as
 {$codepage utf8}

 The attached program show how data loss can occur

 Luiz



program testStringConstantCP;

{$mode objfpc}{$H+}
{.$codepage utf8}

uses
  Classes, sysutils;

function StrToHex(const S: String): String;
var
  i: Integer;
begin
  Result := '';
  if S = '' then
Exit;
  for i := 1 to Length(S) do
  begin
Result := Result + IntToHex(Byte(S[i]), 0);
  end;
end;

var
  W: UnicodeString;
  S, S_2: String;
  SUTF8, SUTF8_2: UTF8String;
begin
  SetMultiByteConversionCodePage(CP_UTF8);
  W := 'ã';
  Write('W: ': 10, StringCodePage(W): 6, ' - ');
  WriteLn(W: 6);

  S := 'ã';
  Write('S: ': 10,StringCodePage(S): 6, ' - ');
  WriteLn(S: 6, ' - ', StrToHex(S));

  S_2 := W;
  Write('S_2: ': 10,StringCodePage(S_2): 6, ' - ');
  WriteLn(S_2: 6, ' - ', StrToHex(S_2));

  SUTF8 := W;
  Write('SUTF8: ': 10,StringCodePage(SUTF8): 6, ' - ');
  WriteLn(SUTF8: 6, ' - ', StrToHex(SUTF8));

  SUTF8_2 := S;
  Write('SUTF8_2: ': 10, StringCodePage(SUTF8_2): 6, ' - ');
  WriteLn(SUTF8_2: 6, ' - ', StrToHex(SUTF8_2));
end.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread Sven Barth


On 24.11.2014 01:37, luiz americo pereira camara wrote:



2014-11-20 13:21 GMT-03:00 Mattias Gaertner nc-gaert...@netcologne.de
mailto:nc-gaert...@netcologne.de:


Please test and tell what you find out.



Without {$codepage utf8} directive String constants will get Code Page 0
(CP_ACP) and not the 1200 (UTF16 - UnicodeString).

String variables assigned to those constants will also have Code Page = 0

This is because the constant string code page is evaluated at compile time

Not sure if there's a compiler command line param with same effect as
{$codepage utf8}

The attached program show how data loss can occur


The command line parameter for this is -Fcutf8.

Regards,
Sven

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-23 Thread Sven Barth


On 24.11.2014 03:19, luiz americo pereira camara wrote:

I updated the test app to show the hexadecimal representation of the string.

When {$codepage utf8} is set, all string encoding and content is right
matching each other regardless of MultiByteConversionCodePage

Without {$codepage utf8}:

When MultiByteConversionCodePage is CP_ACP (default) one string gets the
UTF8 content but code page is system ansi (1252 in my case)
When MultiByteConversionCodePage is UTF8 and two strings (converted from
WideString) get code page UTF8 but content is wrong


Yes. $codepage is for the how the compiler parses the constants while 
MultiByteConversionCodePage is for the runtime behavior. In theory this 
is all documented at 
http://wiki.freepascal.org/FPC_Unicode_support#String_constants


Regards,
Sven


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Jürgen Hestermann


Am 2014-11-20 um 17:21 schrieb Mattias Gaertner:
 The development version of FPC 2.7.1 has extended Strings and many RTL
 functions now work for codepages other than the system codepage.

 2. The new mode: The LCL, FCL and RTL treat all String as UTF-8 encoded.
...
 When accessing the WinAPI you must use the W functions or use
 UTF8ToWinCP and WinCPToUTF8.

Is this correct?
The W functions of the WinAPI expect UTF16 so
a conversion needs to be done in both cases,
either to System code page or to UTF16.
Or can we use STRING with WinAPI W functions directly?

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Mattias Gaertner

On Sat, 22 Nov 2014 14:37:00 +0100
Jürgen Hestermann juergen.hesterm...@gmx.de wrote:

 Am 2014-11-20 um 17:21 schrieb Mattias Gaertner:
   The development version of FPC 2.7.1 has extended Strings and many RTL
   functions now work for codepages other than the system codepage.
 
   2. The new mode: The LCL, FCL and RTL treat all String as UTF-8 encoded.
 ...
   When accessing the WinAPI you must use the W functions or use
   UTF8ToWinCP and WinCPToUTF8.
 
 Is this correct?
 The W functions of the WinAPI expect UTF16 so
 a conversion needs to be done in both cases,
 either to System code page or to UTF16.
 Or can we use STRING with WinAPI W functions directly?

You can use them directly.

For example:

procedure TForm1.FormCreate(Sender: TObject);
var
  s: string; // String = AnsiString because of $H+
begin
  s:=GetCommandLineW;
  // GetCommandLineW returns a UTF-16 PWideChar
  // the compiler adds code to convert this to the
  // default system codepage (CP_ACP = CP_UTF8)
  // the resulting string has StringCodePage CP_ACP
  // and is encoded in UTF-8.
  // therefore you can simply use it with the LCL
  Memo1.Lines.Add(s);
end;

You will get a compiler warning (id 4105), that WideString to Ansistring
might loose data. The warning is right if the default string codepage is
not UTF-8. If your code only runs with the RTL in UTF-8 mode, you
can disable this warning.
As alternative you can use:
  s:=UTF8Encode(GetCommandLineW);

You must also use UTF8Encode if your code should run with both FPC 2.6.4
and 2.7.1.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Jürgen Hestermann


Am 2014-11-22 um 15:06 schrieb Mattias Gaertner:
 procedure TForm1.FormCreate(Sender: TObject);
 var s: string; // String = AnsiString because of $H+
 begin
   s:=GetCommandLineW;
   // GetCommandLineW returns a UTF-16 PWideChar
   // the compiler adds code to convert this to the
   // default system codepage (CP_ACP = CP_UTF8)
   // the resulting string has StringCodePage CP_ACP
   // and is encoded in UTF-8.
   // therefore you can simply use it with the LCL

Okay.
Does that mean that the compiler *always* assumes that
String=UTF-8 encoded AnsiString and converts to
other (known) encoded string types if needed?



--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Mattias Gaertner

On Sat, 22 Nov 2014 16:18:09 +0100
Jürgen Hestermann juergen.hesterm...@gmx.de wrote:

 Am 2014-11-22 um 15:06 schrieb Mattias Gaertner:
   procedure TForm1.FormCreate(Sender: TObject);
   var s: string; // String = AnsiString because of $H+
   begin
 s:=GetCommandLineW;
 // GetCommandLineW returns a UTF-16 PWideChar
 // the compiler adds code to convert this to the
 // default system codepage (CP_ACP = CP_UTF8)
 // the resulting string has StringCodePage CP_ACP
 // and is encoded in UTF-8.
 // therefore you can simply use it with the LCL
 
 Okay.
 Does that mean that the compiler *always* assumes that
 String=UTF-8 encoded AnsiString 

Yes, with the UTF8 RTL. The default RTL uses system codepage.

 and converts to other (known) encoded string types if needed?

Yes. That's the new feature of FPC 2.7.1.
What other encoded string types do you have in mind?

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Michael Van Canneyt




On Sat, 22 Nov 2014, Mattias Gaertner wrote:


On Sat, 22 Nov 2014 16:18:09 +0100
Jürgen Hestermann juergen.hesterm...@gmx.de wrote:


Am 2014-11-22 um 15:06 schrieb Mattias Gaertner:
  procedure TForm1.FormCreate(Sender: TObject);
  var s: string; // String = AnsiString because of $H+
  begin
s:=GetCommandLineW;
// GetCommandLineW returns a UTF-16 PWideChar
// the compiler adds code to convert this to the
// default system codepage (CP_ACP = CP_UTF8)
// the resulting string has StringCodePage CP_ACP
// and is encoded in UTF-8.
// therefore you can simply use it with the LCL

Okay.
Does that mean that the compiler *always* assumes that
String=UTF-8 encoded AnsiString 


Yes, with the UTF8 RTL. The default RTL uses system codepage.


Careful, there is no such thing as the UTF8 RTL.

There is now a Unicode and CodePage-aware RTL.

That means it has:
- Codepage aware single-byte strings.
  The codepage of a string may, or may not, be UTF8 (i.e. Unicode).
- Widestrings (unicode).
The compiler handles conversion of codepages transparantly.

The codepage aware single-byte strings are not automatically UTF-8.
On linux, this is probably so. But on windows, this is not necessarily so,

Additionally, most basic File I/O routines now correctly call the underlying 
OS-es file routines with the codepage the OS expects (which is WideString on Windows).


The exact behaviour of the RTL is controlled by a couple of variables:
DefaultSystemCodePage, DefaultFileSystemCodePage , DefaultRTLFileSystemCodePage.

See http://wiki.freepascal.org/FPC_Unicode_support.

Michael.--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Hans-Peter Diettrich


Mattias Gaertner schrieb:


  // GetCommandLineW returns a UTF-16 PWideChar
  // the compiler adds code to convert this to the
  // default system codepage (CP_ACP = CP_UTF8)
  // the resulting string has StringCodePage CP_ACP
  // and is encoded in UTF-8.


Does this mean that Lazarus (new mode) ignores the OS system codepage 
setting?


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Mattias Gaertner

On Sat, 22 Nov 2014 17:18:35 +0100
Hans-Peter Diettrich drdiettri...@aol.com wrote:

 Mattias Gaertner schrieb:
 
// GetCommandLineW returns a UTF-16 PWideChar
// the compiler adds code to convert this to the
// default system codepage (CP_ACP = CP_UTF8)
// the resulting string has StringCodePage CP_ACP
// and is encoded in UTF-8.
 
 Does this mean that Lazarus (new mode) ignores the OS system codepage 
 setting?

To be exact:
Lazarus unit fpcadds sets the default string encoding
(DefaultSystemCodePage) to CP_UTF8. The OS system codepage of
Windows is not changed. All non W (e.g. A) functions still return
and expect strings in the Windows system codepage.
You can convert between UTF8 and Windows system codepage with
UTF8ToWinCP and WinCPToUTF8.

So, yes, a LCL application can now mostly ignore the system codepage.
Finding the exceptions and traps is the goal of this mail thread.

Please test and report what you find out.

Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-22 Thread Mattias Gaertner

On Sat, 22 Nov 2014 17:38:33 +0100 (CET)
Michael Van Canneyt mich...@freepascal.org wrote:

[...]
  Yes, with the UTF8 RTL. The default RTL uses system codepage.
 
 Careful, there is no such thing as the UTF8 RTL.
 
 There is now a Unicode and CodePage-aware RTL.

Well, yes, you are right of course.
But Unicode and CodePage-aware RTL set to UTF-8 is an awkwardly long
title.
Also many users think that the new string types will break all
their code and add lots of overhead. I want to advertise, that this is
not so. On the contrary, it is very compatible, you get cross
platform Unicode and the overhead is pretty small.
And last but not least: Programming Unicode has become
easier, because string encoding is now more consistent.

 
 That means it has:
 - Codepage aware single-byte strings.
The codepage of a string may, or may not, be UTF8 (i.e. Unicode).
 - Widestrings (unicode).
 The compiler handles conversion of codepages transparantly.
 
 The codepage aware single-byte strings are not automatically UTF-8.
 On linux, this is probably so. But on windows, this is not necessarily so,

True. Although many programmers misunderstand what this means. It is not
as scary as it sounds.

 
 Additionally, most basic File I/O routines now correctly call the underlying 
 OS-es file routines with the codepage the OS expects (which is WideString on 
 Windows).

Is it safe to say UTF-16? Or are there still UCS-2 Windows?

 
 The exact behaviour of the RTL is controlled by a couple of variables:
 DefaultSystemCodePage, DefaultFileSystemCodePage , 
 DefaultRTLFileSystemCodePage.

Yes, that's the important bit that FPC made better than Delphi. :)

 
 See http://wiki.freepascal.org/FPC_Unicode_support.


Mattias

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

2014-11-20 Thread silvioprog

On Thu, Nov 20, 2014 at 1:21 PM, Mattias Gaertner nc-gaert...@netcologne.de
 wrote:

 Hi all, especially Windows users,

 The development version of FPC 2.7.1 has extended Strings and many RTL
 functions now work for codepages other than the system codepage.

 This means Lazarus can now be compiled in two modes:

 1. The old mode: LCL treats all String as UTF-8 encoded. When
 accessing RTL and WinAPI functions you have to use the UTF8 functions.
 For example aStringList.LoadFromFile(UTF8ToSys(Filename)) and
 FileExistsUTF8. Note that UTF8ToSys only supports characters in the
 Windows code page, while FileExistsUTF8 supports the full Unicode range.

 2. The new mode: The LCL, FCL and RTL treat all String as UTF-8
 encoded. Most RTL file functions now work with full Unicode.
 For example FileExists and aStringList.LoadFromFile(Filename) now
 support full Unicode.
 AnsiToUTF8, UTF8ToAnsi, SysToUTF8, UTF8ToAnsi have no effect. Many
 UTF8Encode and UTF8Decode calls are no longer needed, because when
 assigning UnicodeString to String and vice versus the compiler does it
 automatically for you.
 When accessing the WinAPI you must use the W functions or use
 UTF8ToWinCP and WinCPToUTF8.
 You can enable the new mode by compiling Lazarus clean with
 -dEnableUTF8RTL.

 More information about the new FPC Unicode Support:
 http://wiki.freepascal.org/FPC_Unicode_support

 RTL functions that now support Unicode under Windows:
 http://wiki.freepascal.org/FPC_Unicode_support#RTL_changes

 The above links are about the default RTL with system code page.
 I want to create a Wiki page to gather all information about
 the UTF8 RTL for Lazarus users and how to adapt their code.

 Please test and tell what you find out.


 Mattias


The best news of the year! \o/ \o/ \o/

Thanks thanks thanks Lazarus/FPC team! (y)

-- 
Silvio Clécio
My public projects - github.com/silvioprog
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

58 matches

Mail list logo