Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Michael Van Canneyt via fpc-pascal
On Tue, 9 Mar 2021, Florian Klämpfl via fpc-pascal wrote: By using the necessary IFDEF mechanism in the config file, we can avoid inserting it for windows (which does not need it) or the smaller embedded platforms (which cannot handle it). People that don't need/want this can remove the

Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Florian Klämpfl via fpc-pascal
> Am 09.03.2021 um 10:06 schrieb Michael Van Canneyt via fpc-pascal > : > >  > >> On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote: >> >>> On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: >>> UnicodeString may be used in a program simply because the included unit has >>>

Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Tomas Hajny via fpc-pascal
On 2021-03-09 09:46, Graeme Geldenhuys via fpc-pascal wrote: On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: UnicodeString may be used in a program simply because the included unit has it used in its interface. That may be the case even if there's no use of characters outside of US

Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Michael Van Canneyt via fpc-pascal
On Tue, 9 Mar 2021, Mattias Gaertner via fpc-pascal wrote: On Tue, 9 Mar 2021 08:04:54 +0100 Sven Barth via fpc-pascal wrote: [...] FPC is not Java. In FPC you have more fine-grained control over the resulting binary than "install big, fat runtime". Not to mention that FPC can target

Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Michael Van Canneyt via fpc-pascal
On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote: On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: UnicodeString may be used in a program simply because the included unit has it used in its interface. That may be the case even if there's no use of characters outside of US

Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Mattias Gaertner via fpc-pascal
On Tue, 9 Mar 2021 08:04:54 +0100 Sven Barth via fpc-pascal wrote: >[...] > FPC is not Java. In FPC you have more fine-grained control over the > resulting binary than "install big, fat runtime". Not to mention that > FPC can target resource constrained systems as well. Optional is good. Maybe

Re: [fpc-pascal] Unicode chars losing information

2021-03-09 Thread Graeme Geldenhuys via fpc-pascal
On 09/03/2021 1:44 am, Tomas Hajny via fpc-pascal wrote: > UnicodeString may be used in a program simply because the included unit > has it used in its interface. That may be the case even if there's no > use of characters outside of US ASCII at all. So FPC rather goes with the fact that data

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal
On Tue, 9 Mar 2021, Graeme Geldenhuys via fpc-pascal wrote: On 08/03/2021 2:49 pm, Michael Van Canneyt via fpc-pascal wrote: In that sense, unicode conversion support is something optional and so we require you to enable it explicitly, since enabling it has some drawbacks: Surely if you

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Sven Barth via fpc-pascal
Graeme Geldenhuys via fpc-pascal schrieb am Di., 9. März 2021, 00:56: > > On 07/03/2021 5:48 pm, Nikolay Nikolov via fpc-pascal wrote: > > It depends on what you mean by "just working". > > No, "just worked" is exactly what it says on the tin. It is FPC that > overcomplicating matters. > > > As

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Nikolay Nikolov via fpc-pascal
On 3/9/21 2:18 AM, Graeme Geldenhuys via fpc-pascal wrote: On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote: It's not possible to safely use unicodestring without knowing how 16bit unicode works. The compiler can't solve that. I disagree. Java does just that! The issue is the

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Martin Frb via fpc-pascal
On 08/03/2021 23:26, Tomas Hajny via fpc-pascal wrote: On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote: I can think of 2 groups already. 1) Conversion due to explicit declared different encoding.    AnAnsiString := SomeWideString;   AnAsciiString := AnUtf8String; // declared as "type

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Graeme Geldenhuys via fpc-pascal
On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote: > It's not possible to safely use unicodestring without > knowing how 16bit unicode works. The compiler can't solve that. I disagree. Java does just that! The issue is the assumption of using array indexing into the a string. I guess

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Graeme Geldenhuys via fpc-pascal
On 08/03/2021 2:49 pm, Michael Van Canneyt via fpc-pascal wrote: > In that sense, unicode conversion support is something optional and so we > require you to enable it explicitly, since enabling it has some drawbacks: Surely if you explicitly use the UnicodeString type, the compiler should know

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Graeme Geldenhuys via fpc-pascal
On 07/03/2021 5:48 pm, Nikolay Nikolov via fpc-pascal wrote: > It depends on what you mean by "just working". No, "just worked" is exactly what it says on the tin. It is FPC that overcomplicating matters. As an example, here is Java that also uses UTF-16 encoding, just like FPC's UnicodeString

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Tomas Hajny via fpc-pascal
On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote: . . In the example the index access should have returned a single codeunit, which was known to be a complete codepoint. As far as I understand the unexpected part was, that the unicode string did not contain the content of the string

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal
On Mon, 8 Mar 2021, Martin Frb via fpc-pascal wrote: Obviously knowing the presence/absence of a widestring manager allows to refine warnings. It does not. The compiler has no way to know if the widestring manager actually does a complete or even a good job. Maybe it just does logging and

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Martin Frb via fpc-pascal
On 08/03/2021 20:49, Jonas Maebe via fpc-pascal wrote: On 08/03/2021 19:16, Ryan Joseph via fpc-pascal wrote: I agree it would be nice to have some warning that indexing the unicodeString wouldn't work as expected. Then the compiler would have to give a warning for any indexing of

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Jonas Maebe via fpc-pascal
On 08/03/2021 19:16, Ryan Joseph via fpc-pascal wrote: > I agree it would be nice to have some warning that indexing the unicodeString > wouldn't work as expected. Then the compiler would have to give a warning for any indexing of unicodestring. That would render it useless, because everyone

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Ryan Joseph via fpc-pascal
So I was indeed able to solve the problem using {$codepage utf8} and using the CWString unit. Does this do anything besides change the backend of the UnicodeString/UnicodeChar type? I using other string types in that unit and I'm curious if I've put some kind of performance burden on the other

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal
On Mon, 8 Mar 2021, Tomas Hajny via fpc-pascal wrote: On 2021-03-08 15:49, Michael Van Canneyt via fpc-pascal wrote: On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote: Michael Van Canneyt via fpc-pascal wrote: You didn't configure your environment to deal correctly with Unicode.

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal
On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote: Michael Van Canneyt wrote: The output for me is the same, regardless of the -FcUTF-8 flag being present or not: question marks. But if I add uses cwstring; all will be well. Rationale: Without that, the RTL cannot convert

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Adriaan van Os via fpc-pascal
Michael Van Canneyt wrote: The output for me is the same, regardless of the -FcUTF-8 flag being present or not: question marks. But if I add uses cwstring; all will be well. Rationale: Without that, the RTL cannot convert whatever the compiler wrote in the binary to UTF8 to display it on

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Tomas Hajny via fpc-pascal
On 2021-03-08 11:59, Adriaan van Os via fpc-pascal wrote: Hi, adriaan% cat uniquizz-utf8.pas {$codepage utf8} program uniquizz; var chars: UnicodeString; begin chars := '⌘ key'; writeln(chars); writeln(chars[1]); writeln( 'size ', sizeOf( chars)); writeln( 'length ', length(

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Michael Van Canneyt via fpc-pascal
On Mon, 8 Mar 2021, Adriaan van Os via fpc-pascal wrote: adriaan% cat uniquizz-utf8.pas {$codepage utf8} program uniquizz; var chars: UnicodeString; begin chars := '⌘ key'; writeln(chars); writeln(chars[1]); writeln( 'size ', sizeOf( chars)); writeln( 'length ', length( chars));

Re: [fpc-pascal] Unicode chars losing information

2021-03-08 Thread Adriaan van Os via fpc-pascal
adriaan% cat uniquizz-utf8.pas {$codepage utf8} program uniquizz; var chars: UnicodeString; begin chars := '⌘ key'; writeln(chars); writeln(chars[1]); writeln( 'size ', sizeOf( chars)); writeln( 'length ', length( chars)); end. adriaan% fpc uniquizz-utf8.pas -FcUTF-8 Free Pascal

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Marco van de Voort via fpc-pascal
Op 2021-03-07 om 22:26 schreef Bart via fpc-pascal: On Sun, Mar 7, 2021 at 5:31 PM Marco van de Voort via fpc-pascal wrote: Probably it is not in the BMP and thus needs more position than one. Length(Char) is 5 according to fpc, I see 5 "graphemes" Indeed: .Ld1$strlab:     .short   

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Bart via fpc-pascal
On Sun, Mar 7, 2021 at 5:31 PM Marco van de Voort via fpc-pascal wrote: > Probably it is not in the BMP and thus needs more position than one. Length(Char) is 5 according to fpc, I see 5 "graphemes", which suggest that all of them fit into 1 WideChar? -- Bart

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Nikolay Nikolov via fpc-pascal
On 3/7/21 7:21 PM, Ryan Joseph via fpc-pascal wrote: On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal wrote: Yes it is. And there are about 1114000 unicode codepoints, or about 17 times what fits in a 2-byte wide char. https://en.wikipedia.org/wiki/Code_point

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal
> On Mar 7, 2021, at 10:21 AM, Ryan Joseph wrote: > > I thought unicode strings "just worked" but maybe that's UTF-8 and the > character I want is maybe UTF-16. What are you supposed to do then? > UnicodeString knows how to print the full string so all the data is there but > I can't index

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal
> On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal > wrote: > > > Yes it is. And there are about 1114000 unicode codepoints, or about 17 times > what fits in a 2-byte wide char. > > https://en.wikipedia.org/wiki/Code_point > > https://en.wikipedia.org/wiki/UTF-16 I thought

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Marco van de Voort via fpc-pascal
Op 2021-03-07 om 17:38 schreef Ryan Joseph via fpc-pascal: On Mar 7, 2021, at 9:31 AM, Marco van de Voort via fpc-pascal wrote: Probably it is not in the BMP and thus needs more position than one. Isn't char[1] a 2 byte wide char? Not sure I understand "more position than on" though.

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal
> On Mar 7, 2021, at 9:31 AM, Marco van de Voort via fpc-pascal > wrote: > > Probably it is not in the BMP and thus needs more position than one. Isn't char[1] a 2 byte wide char? Not sure I understand "more position than on" though. Regards, Ryan Joseph

Re: [fpc-pascal] Unicode chars losing information

2021-03-07 Thread Marco van de Voort via fpc-pascal
Op 2021-03-07 om 17:21 schreef Ryan Joseph via fpc-pascal: I came across a bug which was caused but a unicode character losing information and narrowed it down to this. Why doesn't the chars[1] print the same character as appeared in the string? var chars: UnicodeString; begin chars :=

[fpc-pascal] Unicode chars losing information

2021-03-07 Thread Ryan Joseph via fpc-pascal
I came across a bug which was caused but a unicode character losing information and narrowed it down to this. Why doesn't the chars[1] print the same character as appeared in the string? var chars: UnicodeString; begin chars := '⌘⌥⌫⇧^'; writeln(chars); writeln(chars[1]); end. Prints: