Re: [fpc-devel] UTF-8 string literals

2017-05-11 Thread Mattias Gaertner
On Thu, 11 May 2017 14:52:37 +0300
Juha Manninen  wrote:

> On Wed, May 10, 2017 at 7:00 PM, Martok  wrote:
> > I just searched for "Unicode".  
> 
> I wanted to delete the old page
>  http://wiki.freepascal.org/LCL_Unicode_Support
> completely but I don't know how to do it so I just made it empty.
> Anybody knows how to delete it?

Don't delete a page. Add a hint where the new content is.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-11 Thread Juha Manninen
On Wed, May 10, 2017 at 7:00 PM, Martok  wrote:
> I just searched for "Unicode".

I wanted to delete the old page
 http://wiki.freepascal.org/LCL_Unicode_Support
completely but I don't know how to do it so I just made it empty.
Anybody knows how to delete it?

I also renamed the "Better Unicode Support ..." page to "Unicode Support ...".
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus
I am now improving and simplifying it.
I try to concentrate on how to code in a Delphi compatible way.

Martok, please take care of the other pages you found. Mark them as
invalid or deprecated, delete wrong info, rename them ... whatever. Be
creative.
FYI, the wiki can be edited by anybody.

Juha
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-11 Thread Michael Van Canneyt



On Wed, 10 May 2017, Martok wrote:


But apparently everything is rainbows and unicorns and there is absolutely no
problem with the documentation at all, so I guess this week-long discussion here
never happened anyway.


This is not quite correct. I have proposed to add a table to the official
documentation, documenting in the programmer's guide how the strings are
stored.

But I personally lack the information to put in this table, and I am waiting
for input from others. I have put in a mail what I know, it needs to be
amended by the compiler people.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-11 Thread Martok
That's the one I also think Sven was talking about.
I just searched for "Unicode". Michael's proposal comes up, but I guess the
title is fairly obvious.


But apparently everything is rainbows and unicorns and there is absolutely no
problem with the documentation at all, so I guess this week-long discussion here
never happened anyway.


Martok

Am 10.05.2017 um 08:38 schrieb Mattias Gaertner:
> On Tue, 9 May 2017 14:59:16 +0200
> Michael Schnell  wrote:
> 
>> On 06.05.2017 09:39, Sven Barth via fpc-devel wrote:
>>> That might be the one from Michael Schnell.  
>> Very unlikely, as this text does not mention anything about how a source 
>> file byte sequence is converted in a String constant / literal.
> 
> I think he meant this one:
> http://wiki.lazarus.freepascal.org/index.php?title=not_Delphi_compatible_enhancement_for_Unicode_Support=history
> 
> I thought Mschnell is Michael Schnell. Was this wrong?
> 
> Mattias
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> 


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-10 Thread Mattias Gaertner
On Tue, 9 May 2017 14:59:16 +0200
Michael Schnell  wrote:

> On 06.05.2017 09:39, Sven Barth via fpc-devel wrote:
> > That might be the one from Michael Schnell.  
> Very unlikely, as this text does not mention anything about how a source 
> file byte sequence is converted in a String constant / literal.

I think he meant this one:
http://wiki.lazarus.freepascal.org/index.php?title=not_Delphi_compatible_enhancement_for_Unicode_Support=history

I thought Mschnell is Michael Schnell. Was this wrong?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-10 Thread Michael Schnell

On 06.05.2017 09:39, Sven Barth via fpc-devel wrote:

That might be the one from Michael Schnell.
Very unlikely, as this text does not mention anything about how a source 
file byte sequence is converted in a String constant / literal.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-10 Thread Michael Schnell

On 06.05.2017 09:39, Sven Barth via fpc-devel wrote:


That might be the one from Michael Schnell. Probably it should be 
marked with a big, fat warning that it's merely a user's suggestion 
and nothing official.


I hope it is absolutely clear in the text that this is only a suggestion 
and not something that is real (or will be real in the near future).


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-08 Thread Martok
> That might be the one from Michael Schnell. Probably it should be marked with 
> a
> big, fat warning that it's merely a user's suggestion and nothing official.
Not even that. This one looks relatively obvious to me ;)

I've filed a bug as  for 
reference.


Martok

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-07 Thread Sven Barth via fpc-devel
On 07.05.2017 14:16, Marco van de Voort wrote:
> In our previous episode, Sven Barth via fpc-devel said:
 Is there a plan to fix it?
>>>
>>> Now it is fixed :D (revision 36116; maybe we should merge that to fixes
>> once I or someone else tested a big endian target)
>>
>> Okay, it works correctly on big endian targets as well (and Mac OS X 10.4
>> even has valid characters for the console to test with :D ). Thus this
>> change could be merged to 3.0.3.
> 
> Done.

Thanks :)

Regards,
Sven

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-07 Thread Marco van de Voort
In our previous episode, Sven Barth via fpc-devel said:
> > > Is there a plan to fix it?
> >
> > Now it is fixed :D (revision 36116; maybe we should merge that to fixes
> once I or someone else tested a big endian target)
> 
> Okay, it works correctly on big endian targets as well (and Mac OS X 10.4
> even has valid characters for the console to test with :D ). Thus this
> change could be merged to 3.0.3.

Done.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-07 Thread Sven Barth via fpc-devel
Am 05.05.2017 16:08 schrieb "Sven Barth" :
>
> Am 05.05.2017 16:03 schrieb "Juha Manninen" :
> >
> > On Fri, May 5, 2017 at 2:53 PM, Mattias Gaertner
> >  wrote:
> > > 1. When using a character outside BMP FPC stops with:
> > > Error: UTF-8 code greater than 65535 found
> > > For example:
> > > const Eyes = '';
> >
> > I copy a related post from Lazarus list by myself and Sven Barth.
> > It belongs here:
> >
> > On Fri, May 5, 2017 at 3:56 PM, Sven Barth via Lazarus
> >  wrote:
> > > That is mainly due to the compiler not supporting surrogate pairs for
the
> > > UTF-8 -> UTF-16 conversion. If it would support them, then there
wouldn't be
> > > a problem anymore...
> >
> > That is a serious bug. Getting codepoints right is the absolute
> > minimum requirement for Unicode support. Surrogate pairs are the
> > UTF-16 equivalent of multi-byte codepoints in UTF-8.
> >
> > Now I understand this was not caused by our UTF-8 run-time switch
> > "hack". It is a plain bug in FPC.
> > Is there a plan to fix it?
>
> Now it is fixed :D (revision 36116; maybe we should merge that to fixes
once I or someone else tested a big endian target)

Okay, it works correctly on big endian targets as well (and Mac OS X 10.4
even has valid characters for the console to test with :D ). Thus this
change could be merged to 3.0.3.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-07 Thread Michael Van Canneyt



On Sun, 7 May 2017, Mattias Gaertner wrote:


On Sun, 7 May 2017 10:27:58 +0200
Florian Klaempfl  wrote:


[...]
2. What would happen then the other way around? When casting the string
constant to a PUnicodeChar (what probably a lot of delphi code does)?


Good point.



[...]
I think, it would nice if Michael (v. C.) prepares some section for the
docs and we comment and help him to improve it.


That would be highly appreciated.


I would be glad to do so, but I need something to start with.

In my reply to Sven I asked if a set of rules exist.

As far as I understand:

- By default, strings are stored internally as UTF-16.
- Unless it is an ascii string, in which case it is stored as plain ascii
- In special cases such as a typecast, the compiler stores them as UTF8 ?

A bit shallow...


Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-07 Thread Mattias Gaertner
On Sun, 7 May 2017 10:27:58 +0200
Florian Klaempfl  wrote:

>[...]
> 2. What would happen then the other way around? When casting the string
> constant to a PUnicodeChar (what probably a lot of delphi code does)?

Good point.


>[...]
> I think, it would nice if Michael (v. C.) prepares some section for the
> docs and we comment and help him to improve it.

That would be highly appreciated.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-07 Thread Florian Klaempfl
Am 05.05.2017 um 13:53 schrieb Mattias Gaertner:
> Hi,
> 
> AFAIK FPC stores UTF-8 string literals (-Fcutf8) 

-Fc tells the compiler only the encoding of the source code page, it
says nothing how string constant shall be encoded.

> as widestrings
> instead of UTF8String. Please correct me if I'm wrong.
> 
> This has several side effects:
> 
> 1. When using a character outside BMP FPC stops with:
> Error: UTF-8 code greater than 65535 found
> For example:
> const Eyes = '';
> 
> 2. Assigning a UTF-8 literal to an UTF8String requires a
> widestringmanager.
> For example non ISO-8859-1 chars are mangled:
> var u: UTF8String = 'äöüالعَرَبِيَّة';
> 
> 3. PChar on a string literal does not work as expected. You get the
> bytes of a widestring instead.

Well, it depends on what you expect :)

> 
> 
> What would happen if FPC would be extended to store UTF-8
> literals as UTF8String? 
> What are the disadvantages?

1. Backward compatibility. Due to its windows origins and history, the
default unicode encoding in FPC is UTF-16, FPC uses also internally
UTF-16 everywhere.

2. What would happen then the other way around? When casting the string
constant to a PUnicodeChar (what probably a lot of delphi code does)?

3. Personally, I still think, UTF-16 is the "native" unicode type: all
important APIs use UTF-16, for me, UTF-8 is a hack.

What we could do of course is, that if a constant is assigned to a
string with explicit utf-8 encoding, that the compiler does the
conversion at run time. But it complicates things even more. This does
not solve the PChar problem, but I think, when somebody uses unicode
source files and PChar, he is on how own :)

I think, it would nice if Michael (v. C.) prepares some section for the
docs and we comment and help him to improve it.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-06 Thread Mattias Gaertner
On Fri, 5 May 2017 16:08:41 +0200
Sven Barth via fpc-devel  wrote:

>[...]
> Now it is fixed :D (revision 36116; maybe we should merge that to fixes
> once I or someone else tested a big endian target)

Thank You!

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-06 Thread Sven Barth via fpc-devel
Am 06.05.2017 08:18 schrieb "Martok" :
> PS: adding to the discussion over on the Lazarus ML: I just found a
fourth wiki
> page describing a slightly different Unicode support. This is getting
ridiculous.

That might be the one from Michael Schnell. Probably it should be marked
with a big, fat warning that it's merely a user's suggestion and nothing
official.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-06 Thread Martok

> You should weigh the advantages you outline here against the disadvantages of
> no longer knowing how string literals will be encoded.
As a programmer, either I don't want to know (declared const without giving
explicit type) or I do, then I did declare it correctly:

{$codepage utf8}
var u: UTF8String = 'äöüالعَرَبِيَّة';
  -> UTF8String containing the characters I entered in the source file (in this
case(!!) just 1:1 copy).

{$codepage utf8}
var u: UCS4String= 'äöü';
  -> UCS4 encoded Version, either 00e4 00f6 00fc or the equivalent
with combining characters

There should probably be an error if the characters I typed don't actually exist
in the declared type (emoji in an UCS2String), but otherwise, there's no good
reason why that shouldn't "just work".

> It means e.g. the resource string tables will have entries that are UTF16 
> encoded
> or entries that are UTF8 encoded, depending on the unit they come from. 
> This is highly undesirable.
Always convert from "unit CP" to UTF8 (or UTF16 if some binary compat is
required), done. Aren't they just internal anyway?

> By forcing everything UTF16 we ensure delphi compatibility (yes it does 
> matter) 
> and we also ensure a uniform set of string tables.
If that was what happened, ok. But from the error message Matthias listed as (1)
I would assume that the actual string type is UCS2String, at least at some point
in the process.

Just my 2 cents...

Martok

PS: adding to the discussion over on the Lazarus ML: I just found a fourth wiki
page describing a slightly different Unicode support. This is getting 
ridiculous.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Michael Van Canneyt



On Fri, 5 May 2017, Sven Barth via fpc-devel wrote:


Am 05.05.2017 15:55 schrieb "Michael Van Canneyt" :




On Fri, 5 May 2017, Mattias Gaertner wrote:


On Fri, 5 May 2017 14:30:32 +0200 (CEST)
Michael Van Canneyt  wrote:


[...]

AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
instead of UTF8String. Please correct me if I'm wrong.



To make sure I was presenting correct facts, I did some tests.

As a result of the tests, I think the above statement is wrong.


In all three cases you are either explicitly or implicitly forcing the
compiler to convert it to Ansi/UTF-8 and since it's a constant it takes a
compiletime shortcut.


That was on purpose because Mattias' example on the Lazarus list required
this. The point was that PChar() is not usable on string literals.

See also his initial mail, which contains the statement:

"3. PChar on a string literal does not work as expected. You get the
bytes of a widestring instead."

So, I did a typecast. (even though I think it is horrible code).


If you'd do a Writeln without the typecast then it will be a UTF-16
constant that is stored in the binary *if* the string contains a character

$7F.


Well, at least now I understand very well why people find it confusing :-)

I think we'll need a comprehensive table in the documentation.
Can this be produced somehow ?

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Sven Barth via fpc-devel
Am 05.05.2017 15:55 schrieb "Michael Van Canneyt" :
>
>
>
> On Fri, 5 May 2017, Mattias Gaertner wrote:
>
>> On Fri, 5 May 2017 14:30:32 +0200 (CEST)
>> Michael Van Canneyt  wrote:
>>
>>> [...]
>>> > AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
>>> > instead of UTF8String. Please correct me if I'm wrong.
>
>
> To make sure I was presenting correct facts, I did some tests.
>
> As a result of the tests, I think the above statement is wrong.

In all three cases you are either explicitly or implicitly forcing the
compiler to convert it to Ansi/UTF-8 and since it's a constant it takes a
compiletime shortcut.
If you'd do a Writeln without the typecast then it will be a UTF-16
constant that is stored in the binary *if* the string contains a character
> $7F.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Michael Van Canneyt



On Fri, 5 May 2017, Mattias Gaertner wrote:


On Fri, 5 May 2017 15:55:32 +0200 (CEST)
Michael Van Canneyt  wrote:


On Fri, 5 May 2017, Mattias Gaertner wrote:

> On Fri, 5 May 2017 14:30:32 +0200 (CEST)
> Michael Van Canneyt  wrote:
> 
>> [...] 
>> > AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
>> > instead of UTF8String. Please correct me if I'm wrong. 


To make sure I was presenting correct facts, I did some tests.

As a result of the tests, I think the above statement is wrong.


Naah, not wrong, just a non precise term "UTF-8 string literal". ;)

ASCII is stored by FPC as 8-bit string. No problem with that.
The interesting part are the non ASCII strings. Try your code with
the string examples I gave.


I used non-ascii. Did you not see the russian characters ?

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Mattias Gaertner
On Fri, 5 May 2017 15:55:32 +0200 (CEST)
Michael Van Canneyt  wrote:

> On Fri, 5 May 2017, Mattias Gaertner wrote:
> 
> > On Fri, 5 May 2017 14:30:32 +0200 (CEST)
> > Michael Van Canneyt  wrote:
> >  
> >> [...]  
> >> > AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
> >> > instead of UTF8String. Please correct me if I'm wrong.  
> 
> To make sure I was presenting correct facts, I did some tests.
> 
> As a result of the tests, I think the above statement is wrong.

Naah, not wrong, just a non precise term "UTF-8 string literal". ;)

ASCII is stored by FPC as 8-bit string. No problem with that.
The interesting part are the non ASCII strings. Try your code with
the string examples I gave.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Sven Barth via fpc-devel
Am 05.05.2017 16:03 schrieb "Juha Manninen" :
>
> On Fri, May 5, 2017 at 2:53 PM, Mattias Gaertner
>  wrote:
> > 1. When using a character outside BMP FPC stops with:
> > Error: UTF-8 code greater than 65535 found
> > For example:
> > const Eyes = '';
>
> I copy a related post from Lazarus list by myself and Sven Barth.
> It belongs here:
>
> On Fri, May 5, 2017 at 3:56 PM, Sven Barth via Lazarus
>  wrote:
> > That is mainly due to the compiler not supporting surrogate pairs for
the
> > UTF-8 -> UTF-16 conversion. If it would support them, then there
wouldn't be
> > a problem anymore...
>
> That is a serious bug. Getting codepoints right is the absolute
> minimum requirement for Unicode support. Surrogate pairs are the
> UTF-16 equivalent of multi-byte codepoints in UTF-8.
>
> Now I understand this was not caused by our UTF-8 run-time switch
> "hack". It is a plain bug in FPC.
> Is there a plan to fix it?

Now it is fixed :D (revision 36116; maybe we should merge that to fixes
once I or someone else tested a big endian target)

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Michael Van Canneyt



On Fri, 5 May 2017, Juha Manninen wrote:


On Fri, May 5, 2017 at 2:53 PM, Mattias Gaertner
 wrote:

1. When using a character outside BMP FPC stops with:
Error: UTF-8 code greater than 65535 found
For example:
const Eyes = '';


I copy a related post from Lazarus list by myself and Sven Barth.
It belongs here:

On Fri, May 5, 2017 at 3:56 PM, Sven Barth via Lazarus
 wrote:

That is mainly due to the compiler not supporting surrogate pairs for the
UTF-8 -> UTF-16 conversion. If it would support them, then there wouldn't be
a problem anymore...


That is a serious bug. Getting codepoints right is the absolute
minimum requirement for Unicode support. Surrogate pairs are the
UTF-16 equivalent of multi-byte codepoints in UTF-8.

Now I understand this was not caused by our UTF-8 run-time switch
"hack". It is a plain bug in FPC.
Is there a plan to fix it?


Incomplete UTF-16 support is a bug. Bugs should always be fixed?

Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Juha Manninen
On Fri, May 5, 2017 at 2:53 PM, Mattias Gaertner
 wrote:
> 1. When using a character outside BMP FPC stops with:
> Error: UTF-8 code greater than 65535 found
> For example:
> const Eyes = '';

I copy a related post from Lazarus list by myself and Sven Barth.
It belongs here:

On Fri, May 5, 2017 at 3:56 PM, Sven Barth via Lazarus
 wrote:
> That is mainly due to the compiler not supporting surrogate pairs for the
> UTF-8 -> UTF-16 conversion. If it would support them, then there wouldn't be
> a problem anymore...

That is a serious bug. Getting codepoints right is the absolute
minimum requirement for Unicode support. Surrogate pairs are the
UTF-16 equivalent of multi-byte codepoints in UTF-8.

Now I understand this was not caused by our UTF-8 run-time switch
"hack". It is a plain bug in FPC.
Is there a plan to fix it?

Juha
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Michael Van Canneyt



On Fri, 5 May 2017, Mattias Gaertner wrote:


On Fri, 5 May 2017 14:30:32 +0200 (CEST)
Michael Van Canneyt  wrote:


[...]
> AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
> instead of UTF8String. Please correct me if I'm wrong.


To make sure I was presenting correct facts, I did some tests.

As a result of the tests, I think the above statement is wrong.

{$codepage utf8}

var
  p : pchar;

begin
  P:=Pchar('some string literal');
end.

Results in the following assembler:

.globl  _$PROGRAM$_Ld1
_$PROGRAM$_Ld1:
.ascii  "some string literal\000"
.Le11:

Not widestring as far as I can see ?

To be sure, I added some russian characters:

.Ld1:
.ascii  "some string literal \320\272\320\270\321\202\320\260"
.ascii  "\320\271\321\201\320\272\320\276\320\263\320\276\000"

Again, not widestring ?

home: >cat u.pp
{$codepage utf8}
var
  p : pchar;

begin
  P:=Pchar('some string literal китайского');
end.

So, I tried a resourcestring:


.Ld3$strlab:
.short  65001,1
.long   0
.quad   -1,30
.Ld3:
.ascii  "some more \320\272\320\270\321\202\320\260\320\271\321"
.ascii  "\201\320\272\320\276\320\263\320\276\000"

Again, no widestring, as far as I can see.

Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Mattias Gaertner
On Fri, 5 May 2017 14:30:32 +0200 (CEST)
Michael Van Canneyt  wrote:

>[...]
> > AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
> > instead of UTF8String. Please correct me if I'm wrong.
> >
> > This has several side effects:
> >
> > 1. When using a character outside BMP FPC stops with:
> > Error: UTF-8 code greater than 65535 found
> > For example:
> > const Eyes = '';
> >
> > 2. Assigning a UTF-8 literal to an UTF8String requires a
> > widestringmanager.
> > For example non ISO-8859-1 chars are mangled:
> > var u: UTF8String = 'äöüالعَرَبِيَّة';  
> 
> I assume you mean UTF-16 literal ?

Huh? The codepage is utf-8, the string type is utf-8, FPC stores UCS-2,
why do you ask about UTF-16?


> > 3. PChar on a string literal does not work as expected. You get the
> > bytes of a widestring instead.  
> 
> You should weigh the advantages you outline here against the disadvantages of
> no longer knowing how string literals will be encoded.

At the moment string literals are encoded in two different ways
depending on codepage, character values, literal format and probably
some more attributes I don't know. That often confuses users. IMO it
would be less confusing if matching string type and codepage would work
without conversion.

 
> It means e.g. the resource string tables will have entries that are UTF16 
> encoded
> or entries that are UTF8 encoded, depending on the unit they come from. 
> This is highly undesirable.

Ehm, the compiled-in resourcestring tables are AnsiString.
AFAIK you need the UTF-8 system codepage to use the full UTF-16
capabilities of the rsj files.

 
> By forcing everything UTF16 we ensure delphi compatibility (yes it does 
> matter) 
> and we also ensure a uniform set of string tables.

It will be a glory day, when this is accomplished. 
But some people can't wait that long.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] UTF-8 string literals

2017-05-05 Thread Michael Van Canneyt



On Fri, 5 May 2017, Mattias Gaertner wrote:


Hi,

AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
instead of UTF8String. Please correct me if I'm wrong.

This has several side effects:

1. When using a character outside BMP FPC stops with:
Error: UTF-8 code greater than 65535 found
For example:
const Eyes = '';

2. Assigning a UTF-8 literal to an UTF8String requires a
widestringmanager.
For example non ISO-8859-1 chars are mangled:
var u: UTF8String = 'äöüالعَرَبِيَّة';


I assume you mean UTF-16 literal ?



3. PChar on a string literal does not work as expected. You get the
bytes of a widestring instead.


You should weigh the advantages you outline here against the disadvantages of
no longer knowing how string literals will be encoded.

It means e.g. the resource string tables will have entries that are UTF16 
encoded
or entries that are UTF8 encoded, depending on the unit they come from. 
This is highly undesirable.


By forcing everything UTF16 we ensure delphi compatibility (yes it does matter) 
and we also ensure a uniform set of string tables.


Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel