Re: [fpc-pascal] Parse unicode scalar

2023-07-04 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 09:12, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 12:38 PM, Nikolay Nikolov via fpc-pascal wrote: For console apps that use the Unicode KVM video unit, I've introduced two functions for determining the display width of a Unicode string in the video unit: function

Re: [fpc-pascal] Parse unicode scalar

2023-07-04 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 12:38 PM, Nikolay Nikolov via fpc-pascal > wrote: > > For console apps that use the Unicode KVM video unit, I've introduced two > functions for determining the display width of a Unicode string in the video > unit: > > function

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 08:08, Nikolay Nikolov wrote: On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: > > You know you're right, with properly enclosed patterns you can capture > everything inside and it works. You won't know if you had unicode in your > string or not though but that depends on what's being parsed and if you care > or not

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal > wrote: > > But you just don't need to do this, in order to tokenize Pascal. The > beginning and the end of the string literal is the apostrophe, which is > ASCII. The bear is a sequence of UTF-8 code units (opaque to the

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:45, Nikolay Nikolov wrote: On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal > wrote: > > For what grammar? What characters are allowed in a token? For example, Free > Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, > it doesn't need to understand Unicode characters, so it

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal wrote: You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal > wrote: > > You need to understand all these terms and know exactly what you need to do. > E.g. are you dealing with keyboard input, are you dealing with the low level > parts of text display, are you searching for something in

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Nikolay Nikolov via fpc-pascal
On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal wrote: function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; // returns the number of codepoints var CodePointLen: longint; CodePoint: longword; begin Result:=0; while

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal > wrote: > > function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; > // returns the number of codepoints > var > CodePointLen: longint; > CodePoint: longword; > begin > Result:=0; > while (ByteCount>0) do begin >

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 17:18:56 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > > First of all: Is it valid UTF-8 or do you have to check for broken > > or malicious sequences? > > If they give the parser broken files that's their problem they need > to fix? the user has control over the file

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread José Mejuto via fpc-pascal
El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió: Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist I also need to know if I come across something like \u1F496 I need to

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal > wrote: > >> What I'm really trying to do is improve a parser so it can read UTF-8 >> files and decode unicode literals in the grammar. > > First of all: Is it valid UTF-8 or do you have to check for broken or > malicious

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Jer Haan via fpc-pascal
Hi Ryan,I’ve created attached unit, which takes a code point and returns the utf8 char as a string. It’s based on the Wikipedia article on UTF8UTF-8 encodes code points in one to four bytes, depending on the value of the code point. The x characters are replaced by the bits of the code point:This

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 15:27:10 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > I was just curious how ChatGPTs implementation compared to other > programmer. Apparently the quality is often terrible. But it can be useful. > What I'm really trying to do is improve a parser so it can read UTF-8

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 3:05 PM, Mattias Gaertner via fpc-pascal > wrote: > > I wonder, is this thread about testing ChatGPT or do you want to > implement something useful? > There are already plenty of optimized UTF-8 functions in the FPC and > Lazarus sources. Maybe too many, and you have

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 12:01:11 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal > > wrote: > > > > Useless array of. > > And it does not return the bytecount. > > it's an open array so what's the problem? >[...] > > Wrong for byteCount=1

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 14:12:03 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > > wrote: > > > > No - in this case, the "header" is the highest bit of that byte > > being 0. > > Oh it's the header BIT. Admittedly I don't understand how this

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal
On 3 July 2023 9:12:03 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal >> wrote: >> >> No - in this case, the "header" is the highest bit of that byte being 0. > >Oh it's the header BIT. Admittedly I don't understand how this function

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > wrote: > > No - in this case, the "header" is the highest bit of that byte being 0. Oh it's the header BIT. Admittedly I don't understand how this function returns the highest bit using that case, which I think he was suggesting.

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal
On 3 July 2023 8:42:05 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal >> wrote: >> >> No, the header of a codepoint to figure out the length. > >so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and >1 for

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal > wrote: > > No, the header of a codepoint to figure out the length. so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 1 for the character? ASCII #100 is the same character in UTF-8 but it needs a

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 11:58:33 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal > > wrote: > > > > There is a header byte. > > > > It depends, if you want to check for invalid UTF-8 sequences. > > > > From LazUTF8: > > > > function

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal > wrote: > > Useless array of. > And it does not return the bytecount. it's an open array so what's the problem? > >> var >> i: Integer; >> byteCount: Integer; >> begin >> // Number of bytes required to represent the

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal > wrote: > > There is a header byte. > > It depends, if you want to check for invalid UTF-8 sequences. > > From LazUTF8: > > function UTF8CodepointSizeFast(p: PChar): integer; > begin > case p^ of >#0..#191 : Result :=

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 08:29:11 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 2, 2023, at 11:16 PM, Jer Haan wrote: > > > > This table is copied from Wikipedia.Hope it’s useful > > for you. If you improve the code pls let me know. > > This is perfect, thanks! Much more complicated than I

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Mattias Gaertner via fpc-pascal
On Mon, 3 Jul 2023 09:34:10 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > Ok today I I just tried to ask ChatGPT and got an answer. I must have > asked the wrong thing yesterday but it got it right today (with one > syntax error using an inline "var" in the code section for some > reason).

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Hairy Pixels via fpc-pascal
> On Jul 3, 2023, at 12:20 AM, Nikolay Nikolov via fpc-pascal > wrote: > > There's no such thing as "unicode scalar" in Unicode terminology: > > https://unicode.org/glossary/ I got it from here

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Hairy Pixels via fpc-pascal
> On Jul 2, 2023, at 11:16 PM, Jer Haan wrote: > > This table is copied from Wikipedia.Hope it’s useful for you. > If you improve the code pls let me know. > This is perfect, thanks! Much more complicated than I thought. I'm curious now, if you were going the other direction and parsing a

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Nikolay Nikolov via fpc-pascal
On 7/2/23 20:38, Martin Frb via fpc-pascal wrote: On 02/07/2023 19:20, Nikolay Nikolov via fpc-pascal wrote: On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote: I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Martin Frb via fpc-pascal
On 02/07/2023 19:20, Nikolay Nikolov via fpc-pascal wrote: On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote: I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 ().

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Nikolay Nikolov via fpc-pascal
On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote: I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 (). There's no such thing as "unicode scalar" in Unicode