Re: readchars, seek back, and readchars again

2020-04-28 Thread Samantha McVey
On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote: > > Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings: > > https://github.com/rakudo/rakudo/issues/3461 > > > > I know it might be far-fetched, but what if your UTF-8 issue and > > Yary's UTF-16 issue

Re: readchars, seek back, and readchars again

2020-04-28 Thread Samantha McVey
On maandag 27 april 2020 09:49:20 CEST Joseph Brenner wrote: > After you do a .readchars, what point in the file would you expect to > be "current"? I would expect it would be the point right after the > last char read. Instead that's true if you're reading ascii > characters but not unicode

Re: readchars, seek back, and readchars again

2020-04-27 Thread Joseph Brenner
After you do a .readchars, what point in the file would you expect to be "current"? I would expect it would be the point right after the last char read. Instead that's true if you're reading ascii characters but not unicode characters up above the ascii range, in which case the "current" point

Re: readchars, seek back, and readchars again

2020-04-26 Thread Joseph Brenner
I decided to open an issue for this one. Even if there's no practical fix for the behavior of readchars, I'd think this odd meaning of the "current" point in the file would need to be better documented: https://github.com/rakudo/rakudo/issues/3646 I simplified the test I've been using: use

Re: readchars, seek back, and readchars again

2020-04-25 Thread Joseph Brenner
> Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings: > https://github.com/rakudo/rakudo/issues/3461 > I know it might be far-fetched, but what if your UTF-8 issue and Yary's UTF-16 issue were related Well, an issue with handling combining characters could easily

Re: readchars, seek back, and readchars again

2020-04-24 Thread William Michels via perl6-users
Hi Joe, I was able to run the code you posted and reproduced the exact same result (Rakudo version 2020.02.1..1 built on MoarVM version 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit (e.g. UTF8-C8), but I didn't see any improvement. Yary has an issue posted

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
I was just posting that. On 4/24/20, Elizabeth Mattijsen wrote: > > >> On 24 Apr 2020, at 22:03, Joseph Brenner wrote: >> >> Thanks, yes I understand unicode and utf-8 reasonably well. >> >>> So Rakudo has to read the next codepoint to make sure that it isn't a >>> combining codepoint. >> >>>

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Another version of my test code, checking .tell throughout: use v6; use Test; my $tmpdir = IO::Spec::Unix.tmpdir; my $file = "$tmpdir/scratch_file.txt"; my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # ሀⶀ䷼ꪪⲤⲎ my $ascii_str = "ABCDEFGHI";

Re: readchars, seek back, and readchars again

2020-04-24 Thread Elizabeth Mattijsen
> On 24 Apr 2020, at 22:03, Joseph Brenner wrote: > > Thanks, yes I understand unicode and utf-8 reasonably well. > >> So Rakudo has to read the next codepoint to make sure that it isn't a >> combining codepoint. > >> It is probably faking up the reads to look right when reading ASCII, but

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Thanks, yes I understand unicode and utf-8 reasonably well. > So Rakudo has to read the next codepoint to make sure that it isn't a > combining codepoint. > It is probably faking up the reads to look right when reading ASCII, but > failing to do that for wider codepoints. I think it'd be the

Re: readchars, seek back, and readchars again

2020-04-24 Thread Brad Gilbert
In UTF8 characters can be 1 to 4 bytes long. UTF8 was designed so that 7-bit ASCII is a subset of it. Any 8bit byte that has its most significant bit set cannot be ASCII. So multi-byte codepoints have the most significant bit set for all of the bytes. The first byte can tell you the number of